Python Postgres Set Column to be Percentile - python

I have a table of transactions and would like to add a percentile column that specifies the percentile of that transaction in that month based on the amount column.
Take this smaller example with quartiles instead of percentiles:
Example Input:
id | month | amount
1 | 1 | 1
2 | 1 | 2
3 | 1 | 5
4 | 1 | 3
5 | 2 | 1
6 | 2 | 3
1 | 2 | 5
1 | 2 | 7
1 | 2 | 9
1 | 2 | 11
1 | 2 | 15
1 | 2 | 16
Example Output
id | month | amount | quartile
1 | 1 | 1 | 25
2 | 1 | 2 | 50
3 | 1 | 5 | 100
4 | 1 | 3 | 75
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 15 | 100
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 7 | 50
1 | 2 | 16 | 100
Currently, I use postgres's percentile_cont function to determine the amount values of the cutoff points for different percentiles and then go through and do update the percentile column accordingly. Unfortunately, this approach is way too slow because I have many different months. Any ideas for how to do this more quickly, preferably combining the calculation of the percentile and the update in one SQL statement.
My code:
num_buckets = 10
for i in range(num_buckets):
decimal_percentile = (i+1)*(1.0/num_buckets)
prev_decimal_percentile = i*1.0/num_buckets
percentile = int(decimal_percentile*100)
cursor.execute("SELECT month,
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC),
percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC)
FROM transactions GROUP BY month;",
(prev_decimal_percentile, decimal_percentile))
iter_cursor = connection.cursor()
for data in cursor:
iter_cursor.execute("UPDATE transactions SET percentile=%s
WHERE month = %s
AND amount >= %s AND amount <= %s;",
(percentile, data[0], data[1], data[2]))

You can do this in a single query, example for 4 buckets:
update transactions t
set percentile = calc_percentile
from (
select distinct on (month, amount)
id,
month,
amount,
calc_percentile
from transactions
join (
select
bucket,
month as calc_month,
percentile_cont(bucket*1.0/4) within group (order by amount asc) as calc_amount,
bucket*100/4 as calc_percentile
from transactions
cross join generate_series(1, 4) bucket
group by month, bucket
) s on month = calc_month and amount <= calc_amount
order by month, amount, calc_percentile
) s
where t.month = s.month and t.amount = s.amount;
Results:
select *
from transactions
order by month, amount;
id | month | amount | percentile
----+-------+--------+------------
1 | 1 | 1 | 25
2 | 1 | 2 | 50
4 | 1 | 3 | 75
3 | 1 | 5 | 100
5 | 2 | 1 | 25
6 | 2 | 3 | 25
1 | 2 | 5 | 50
1 | 2 | 7 | 50
1 | 2 | 9 | 75
1 | 2 | 11 | 75
1 | 2 | 15 | 100
1 | 2 | 16 | 100
(12 rows)
Btw, id should be a primary key, then it could be used in joins for better performance.
DbFiddle.

Related

How to build sequence of purchases for each ID?

I want to create a dataframe that shows me the sequence of what users purchasing according to the sequence column. For example this is my current df:
user_id | sequence | product | price
1 | 1 | A | 10
1 | 2 | C | 15
1 | 3 | G | 1
2 | 1 | B | 20
2 | 2 | T | 45
2 | 3 | A | 10
...
I want to convert it to the following format:
user_id | source_product | target_product | cum_total_price
1 | A | C | 25
1 | C | G | 16
2 | B | T | 65
2 | T | A | 75
...
How can I achieve this?
shift + cumsum + groupby.apply:
def seq(g):
g['source_product'] = g['product']
g['target_product'] = g['product'].shift(-1)
g['price'] = g.price.cumsum().shift(-1)
return g[['user_id', 'source_product', 'target_product', 'price']].iloc[:-1]
df.sort_values('sequence').groupby('user_id', group_keys=False).apply(seq)
# user_id source_product target_product price
#0 1 A C 25.0
#1 1 C G 26.0
#3 2 B T 65.0
#4 2 T A 75.0

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

SQL / Python - how to return count for each attribute and sub-attribute from another table

I have SELECT that returns table which has:
-5 possible values for region (from 1 to 5) and
-3 possible values for age (1-3) with 2 possible values (1 or 2) for gender for each age group.
So table 1. looks something like this:
+----------+-----------+--------------+---------------+---------+
| att_name | att_value | sub_att_name | sub_att_value | percent |
+----------+-----------+--------------+---------------+---------+
| region | 1 | NULL | 0 | 34 |
| region | 2 | NULL | 0 | 22 |
| region | 3 | NULL | 0 | 15 |
| region | 4 | NULL | 0 | 37 |
| region | 5 | NULL | 0 | 12 |
| age | 1 | gender | 1 | 28 |
| age | 1 | gender | 2 | 8 |
| age | 2 | gender | 1 | 13 |
| age | 2 | gender | 2 | 45 |
| age | 3 | gender | 1 | 34 |
| age | 3 | gender | 2 | 34 |
+----------+-----------+--------------+---------------+---------+
Second table holds records with values from table 1. where table 1. unique values for att_name and sub_att_name are table 2. attributes:
+--------+-----+-----+
| region | age | gen |
+--------+-----+-----+
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 3 | 3 | 2 |
| 1 | 3 | 1 |
| 4 | 2 | 2 |
| 5 | 2 | 1 |
+--------+-----+-----+
I want to return count of each unique values for region and age/gender attributes from second table.
Final result should look like this:
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| att_name | att_value | att_value_count | sub_att_name | sub_att_value | sub_att_value_count | percent |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| region | 1 | 1 | NULL | 0 | NULL | 34 |
| region | 2 | 1 | NULL | 0 | NULL | 22 |
| region | 3 | 2 | NULL | 0 | NULL | 15 |
| region | 4 | 1 | NULL | 0 | NULL | 37 |
| region | 5 | 1 | NULL | 0 | NULL | 12 |
| age | 1 | NULL | gender | 1 | 0 | 28 |
| age | 1 | NULL | gender | 2 | 1 | 8 |
| age | 2 | NULL | gender | 1 | 2 | 13 |
| age | 2 | NULL | gender | 2 | 1 | 45 |
| age | 3 | NULL | gender | 1 | 1 | 34 |
| age | 3 | NULL | gender | 2 | 1 | 34 |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
Explanation
Region - doesn't have sub attribute so sub_att_name and sub_att_value_count are NULL.
att_value_count - counts appearance of each unique region (1 for all regions except for region 3 which shows 2 times).
Age/sex - counts combinations of appearance of age and sex (groups are 1/1, 1/2, 2/1, 2/2 and 3/1, 3/2).
Since we need to fill in values only for combinations att_value_count is NULL.
I'm tagging python and pandas in this question since I don't know if this is possible in SQL at all...i hope it is since we are using analytical tools to pull tables and views from database more naturally.
EDIT
SQL - answers looks complicated, I'll test and see if it works tomorrow.
Python - seems more appealing now - is there a way to parse att_name and sub_att_name, find 1 level and 2 level attributes and act accordingly? I think this is only possible with python and we do have different attributes and attributes levels.
I'l already thankful for given answers!
I think this is good enough to solve the issue:
data_1 = {'att_name':['region','region','region','region','region','age','age','age','age','age','age'],'att_value':[1,2,3,4,5,1,1,2,2,3,3],'sub_att_name':[np.nan,np.nan,np.nan,np.nan,np.nan,'gender','gender','gender','gender','gender','gender'],'sub_att_value':[0,0,0,0,0,1,2,1,2,1,2],'percent':[34,22,15,37,12,28,8,13,45,34,34]}
df_1 = pd.DataFrame(data_1)
data_2 = {'region':[2,3,3,1,4,5],'age':[2,1,3,3,2,2],'gen':[1,2,2,1,2,1]}
df_2 = pd.DataFrame(data_2)
df_2_grouped = df_2.groupby(['age','gen'],as_index=False).agg({'region':'count'}).rename(columns={'region':'counts'})
df_final = df_1.merge(df_2_grouped,how='left',left_on=['att_value','sub_att_value'],right_on=['age','gen']).drop(columns=['age','gen']).rename(columns={'counts':'sub_att_value_counts'}
Output of df_final:
att_name att_value sub_att_name sub_att_value percent sub_at_value_count
0 region 1 NaN 0 34 NaN
1 region 2 NaN 0 22 NaN
2 region 3 NaN 0 15 NaN
3 region 4 NaN 0 37 NaN
4 region 5 NaN 0 12 NaN
5 age 1 gender 1 28 NaN
6 age 1 gender 2 8 1.0
7 age 2 gender 1 13 2.0
8 age 2 gender 2 45 1.0
9 age 3 gender 1 34 1.0
10 age 3 gender 2 34 1.0
This is a pandas solution, basically, lookup or map.
df['att_value_count'] = np.nan
s = df['att_name'].eq('region')
df.loc[s, 'att_value_count'] = df.loc[s,'att_value'].map(df2['region'].value_counts())
# step 2
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df['sub_att_value_count'] = np.nan
tmp = df.loc[~s, ['att_value','sub_att_value']]
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df.loc[~s, 'sub_att_value_count'] = counts.lookup(tmp['att_value'], tmp['sub_att_value'])
You can also use merge so as it is more SQL friendly. For example, in step 2:
counts = df2.groupby('age')['gen'].value_counts().reset_index(name='sub_att_value_count')
(df.merge(counts,
left_on=['att_value','sub_att_value'],
right_on=['age','gen'],
how = 'outer'
)
.drop(['age','gen'], axis=1)
)
Output:
att_name att_value sub_att_name sub_att_value percent att_value_count sub_att_value_count
-- ---------- ----------- -------------- --------------- --------- ----------------- ---------------------
0 region 1 nan 0 34 1 nan
1 region 2 nan 0 22 1 nan
2 region 3 nan 0 15 2 nan
3 region 4 nan 0 37 1 nan
4 region 5 nan 0 12 1 nan
5 age 1 gender 1 28 nan 0
6 age 1 gender 2 8 nan 1
7 age 2 gender 1 13 nan 2
8 age 2 gender 2 45 nan 1
9 age 3 gender 1 34 nan 1
10 age 3 gender 2 34 nan 1
Update: Excuse my SQL skill if this doesn't run (it should though)
select
b.*
c.sub_att_value_count
from
(select
df1.*
a.att_value_count
from
(select
region, count(*) as att_value_count
from df2
group by region
) as a
full outer join df1
where df1.att_value = a.region
) as b
full outer join
(
select
age, gender, count(*) as sub_att_value_count
from df2
group by age, gender
) as c
where b.att_value = c.age and b.sub_att_value = c.gender

Add values in two Spark DataFrames, row by row

I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul

Pandas: Transpose, groupby and summarize columns

i have a pandas DataFrame which looks like this:
| Id | Filter 1 | Filter 2 | Filter 3 |
|----|----------|----------|----------|
| 25 | 0 | 1 | 1 |
| 25 | 1 | 0 | 1 |
| 25 | 0 | 0 | 1 |
| 30 | 1 | 0 | 1 |
| 31 | 1 | 0 | 1 |
| 31 | 0 | 1 | 0 |
| 31 | 0 | 0 | 1 |
I need to transpose this table, add "Name" column with the name of the filter and summarize Filters column values. The result table should be like this:
| Id | Name | Summ |
| 25 | Filter 1 | 1 |
| 25 | Filter 2 | 1 |
| 25 | Filter 3 | 3 |
| 30 | Filter 1 | 1 |
| 30 | Filter 2 | 0 |
| 30 | Filter 3 | 1 |
| 31 | Filter 1 | 1 |
| 31 | Filter 2 | 1 |
| 31 | Filter 3 | 2 |
The only solution i have came so far was to use apply function on groupped by Id column, but this mehod is too slow for my case - dataset can be more than 40 columns and 50_000 rows, how can i do this with pandas native methods?(eg Pivot, Transpose, Groupby)
Use:
df_new=df.melt('Id',var_name='Name',value_name='Sum').groupby(['Id','Name']).Sum.sum()\
.reset_index()
print(df_new)
Id Name Sum
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
stack then groupby
df.set_index('Id').stack().groupby(level=[0,1]).sum().reset_index()
Id level_1 0
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
Short version
df.set_index('Id').sum(level=0).stack()#df.groupby('Id').sum().stack()
Using filter and melt
df.filter(like='Filter').groupby(df.Id).sum().T.reset_index().melt(id_vars='index')
index Id value
0 Filter 1 25 1
1 Filter 2 25 1
2 Filter 3 25 3
3 Filter 1 30 1
4 Filter 2 30 0
5 Filter 3 30 1
6 Filter 1 31 1
7 Filter 2 31 1
8 Filter 3 31 2

Categories