I want to create a dataframe that shows me the sequence of what users purchasing according to the sequence column. For example this is my current df:
user_id | sequence | product | price
1 | 1 | A | 10
1 | 2 | C | 15
1 | 3 | G | 1
2 | 1 | B | 20
2 | 2 | T | 45
2 | 3 | A | 10
...
I want to convert it to the following format:
user_id | source_product | target_product | cum_total_price
1 | A | C | 25
1 | C | G | 16
2 | B | T | 65
2 | T | A | 75
...
How can I achieve this?
shift + cumsum + groupby.apply:
def seq(g):
g['source_product'] = g['product']
g['target_product'] = g['product'].shift(-1)
g['price'] = g.price.cumsum().shift(-1)
return g[['user_id', 'source_product', 'target_product', 'price']].iloc[:-1]
df.sort_values('sequence').groupby('user_id', group_keys=False).apply(seq)
# user_id source_product target_product price
#0 1 A C 25.0
#1 1 C G 26.0
#3 2 B T 65.0
#4 2 T A 75.0
I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
I have SELECT that returns table which has:
-5 possible values for region (from 1 to 5) and
-3 possible values for age (1-3) with 2 possible values (1 or 2) for gender for each age group.
So table 1. looks something like this:
+----------+-----------+--------------+---------------+---------+
| att_name | att_value | sub_att_name | sub_att_value | percent |
+----------+-----------+--------------+---------------+---------+
| region | 1 | NULL | 0 | 34 |
| region | 2 | NULL | 0 | 22 |
| region | 3 | NULL | 0 | 15 |
| region | 4 | NULL | 0 | 37 |
| region | 5 | NULL | 0 | 12 |
| age | 1 | gender | 1 | 28 |
| age | 1 | gender | 2 | 8 |
| age | 2 | gender | 1 | 13 |
| age | 2 | gender | 2 | 45 |
| age | 3 | gender | 1 | 34 |
| age | 3 | gender | 2 | 34 |
+----------+-----------+--------------+---------------+---------+
Second table holds records with values from table 1. where table 1. unique values for att_name and sub_att_name are table 2. attributes:
+--------+-----+-----+
| region | age | gen |
+--------+-----+-----+
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 3 | 3 | 2 |
| 1 | 3 | 1 |
| 4 | 2 | 2 |
| 5 | 2 | 1 |
+--------+-----+-----+
I want to return count of each unique values for region and age/gender attributes from second table.
Final result should look like this:
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| att_name | att_value | att_value_count | sub_att_name | sub_att_value | sub_att_value_count | percent |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| region | 1 | 1 | NULL | 0 | NULL | 34 |
| region | 2 | 1 | NULL | 0 | NULL | 22 |
| region | 3 | 2 | NULL | 0 | NULL | 15 |
| region | 4 | 1 | NULL | 0 | NULL | 37 |
| region | 5 | 1 | NULL | 0 | NULL | 12 |
| age | 1 | NULL | gender | 1 | 0 | 28 |
| age | 1 | NULL | gender | 2 | 1 | 8 |
| age | 2 | NULL | gender | 1 | 2 | 13 |
| age | 2 | NULL | gender | 2 | 1 | 45 |
| age | 3 | NULL | gender | 1 | 1 | 34 |
| age | 3 | NULL | gender | 2 | 1 | 34 |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
Explanation
Region - doesn't have sub attribute so sub_att_name and sub_att_value_count are NULL.
att_value_count - counts appearance of each unique region (1 for all regions except for region 3 which shows 2 times).
Age/sex - counts combinations of appearance of age and sex (groups are 1/1, 1/2, 2/1, 2/2 and 3/1, 3/2).
Since we need to fill in values only for combinations att_value_count is NULL.
I'm tagging python and pandas in this question since I don't know if this is possible in SQL at all...i hope it is since we are using analytical tools to pull tables and views from database more naturally.
EDIT
SQL - answers looks complicated, I'll test and see if it works tomorrow.
Python - seems more appealing now - is there a way to parse att_name and sub_att_name, find 1 level and 2 level attributes and act accordingly? I think this is only possible with python and we do have different attributes and attributes levels.
I'l already thankful for given answers!
I think this is good enough to solve the issue:
data_1 = {'att_name':['region','region','region','region','region','age','age','age','age','age','age'],'att_value':[1,2,3,4,5,1,1,2,2,3,3],'sub_att_name':[np.nan,np.nan,np.nan,np.nan,np.nan,'gender','gender','gender','gender','gender','gender'],'sub_att_value':[0,0,0,0,0,1,2,1,2,1,2],'percent':[34,22,15,37,12,28,8,13,45,34,34]}
df_1 = pd.DataFrame(data_1)
data_2 = {'region':[2,3,3,1,4,5],'age':[2,1,3,3,2,2],'gen':[1,2,2,1,2,1]}
df_2 = pd.DataFrame(data_2)
df_2_grouped = df_2.groupby(['age','gen'],as_index=False).agg({'region':'count'}).rename(columns={'region':'counts'})
df_final = df_1.merge(df_2_grouped,how='left',left_on=['att_value','sub_att_value'],right_on=['age','gen']).drop(columns=['age','gen']).rename(columns={'counts':'sub_att_value_counts'}
Output of df_final:
att_name att_value sub_att_name sub_att_value percent sub_at_value_count
0 region 1 NaN 0 34 NaN
1 region 2 NaN 0 22 NaN
2 region 3 NaN 0 15 NaN
3 region 4 NaN 0 37 NaN
4 region 5 NaN 0 12 NaN
5 age 1 gender 1 28 NaN
6 age 1 gender 2 8 1.0
7 age 2 gender 1 13 2.0
8 age 2 gender 2 45 1.0
9 age 3 gender 1 34 1.0
10 age 3 gender 2 34 1.0
This is a pandas solution, basically, lookup or map.
df['att_value_count'] = np.nan
s = df['att_name'].eq('region')
df.loc[s, 'att_value_count'] = df.loc[s,'att_value'].map(df2['region'].value_counts())
# step 2
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df['sub_att_value_count'] = np.nan
tmp = df.loc[~s, ['att_value','sub_att_value']]
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df.loc[~s, 'sub_att_value_count'] = counts.lookup(tmp['att_value'], tmp['sub_att_value'])
You can also use merge so as it is more SQL friendly. For example, in step 2:
counts = df2.groupby('age')['gen'].value_counts().reset_index(name='sub_att_value_count')
(df.merge(counts,
left_on=['att_value','sub_att_value'],
right_on=['age','gen'],
how = 'outer'
)
.drop(['age','gen'], axis=1)
)
Output:
att_name att_value sub_att_name sub_att_value percent att_value_count sub_att_value_count
-- ---------- ----------- -------------- --------------- --------- ----------------- ---------------------
0 region 1 nan 0 34 1 nan
1 region 2 nan 0 22 1 nan
2 region 3 nan 0 15 2 nan
3 region 4 nan 0 37 1 nan
4 region 5 nan 0 12 1 nan
5 age 1 gender 1 28 nan 0
6 age 1 gender 2 8 nan 1
7 age 2 gender 1 13 nan 2
8 age 2 gender 2 45 nan 1
9 age 3 gender 1 34 nan 1
10 age 3 gender 2 34 nan 1
Update: Excuse my SQL skill if this doesn't run (it should though)
select
b.*
c.sub_att_value_count
from
(select
df1.*
a.att_value_count
from
(select
region, count(*) as att_value_count
from df2
group by region
) as a
full outer join df1
where df1.att_value = a.region
) as b
full outer join
(
select
age, gender, count(*) as sub_att_value_count
from df2
group by age, gender
) as c
where b.att_value = c.age and b.sub_att_value = c.gender
I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul
i have a pandas DataFrame which looks like this:
| Id | Filter 1 | Filter 2 | Filter 3 |
|----|----------|----------|----------|
| 25 | 0 | 1 | 1 |
| 25 | 1 | 0 | 1 |
| 25 | 0 | 0 | 1 |
| 30 | 1 | 0 | 1 |
| 31 | 1 | 0 | 1 |
| 31 | 0 | 1 | 0 |
| 31 | 0 | 0 | 1 |
I need to transpose this table, add "Name" column with the name of the filter and summarize Filters column values. The result table should be like this:
| Id | Name | Summ |
| 25 | Filter 1 | 1 |
| 25 | Filter 2 | 1 |
| 25 | Filter 3 | 3 |
| 30 | Filter 1 | 1 |
| 30 | Filter 2 | 0 |
| 30 | Filter 3 | 1 |
| 31 | Filter 1 | 1 |
| 31 | Filter 2 | 1 |
| 31 | Filter 3 | 2 |
The only solution i have came so far was to use apply function on groupped by Id column, but this mehod is too slow for my case - dataset can be more than 40 columns and 50_000 rows, how can i do this with pandas native methods?(eg Pivot, Transpose, Groupby)
Use:
df_new=df.melt('Id',var_name='Name',value_name='Sum').groupby(['Id','Name']).Sum.sum()\
.reset_index()
print(df_new)
Id Name Sum
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
stack then groupby
df.set_index('Id').stack().groupby(level=[0,1]).sum().reset_index()
Id level_1 0
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
Short version
df.set_index('Id').sum(level=0).stack()#df.groupby('Id').sum().stack()
Using filter and melt
df.filter(like='Filter').groupby(df.Id).sum().T.reset_index().melt(id_vars='index')
index Id value
0 Filter 1 25 1
1 Filter 2 25 1
2 Filter 3 25 3
3 Filter 1 30 1
4 Filter 2 30 0
5 Filter 3 30 1
6 Filter 1 31 1
7 Filter 2 31 1
8 Filter 3 31 2