SQL/Impala: Applying another group by on the first group by output - python

I need to do a group by on top of a group by output. For example, in the below table1:
id | timestamp | team
----------------------------
1 | 2016-01-02 | A
2 | 2016-02-01 | B
1 | 2016-02-04 | A
1 | 2016-03-05 | A
3 | 2016-05-12 | B
3 | 2016-05-15 | B
4 | 2016-07-07 | A
5 | 2016-08-01 | C
6 | 2015-08-01 | C
1 | 2015-04-01 | A
If I do a query:
query = select id, max(timestamp) as latest_ts from table1' + \
' where timestamp > "2016-01-01 00:00:00" group by id'
I would have:
id | latest_ts |
---------------------
2 | 2016-02-01 |
1 | 2016-03-05 |
3 | 2016-05-15 |
4 | 2016-07-07 |
5 | 2016-08-01 |
However, I am wondering if it is possible to include the team column like below as well?
id | latest_ts | team
----------------------------
2 | 2016-02-01 | B
1 | 2016-03-05 | A
3 | 2016-05-15 | B
4 | 2016-07-07 | A
5 | 2016-08-01 | C
Ultimately, what I really need is to know how many distinct id in each team for year 2016. My expected result should be:
team | count(id)
-------------------
A | 2
B | 2
C | 1
I am trying to do another group by on top of the first group by result using the code below, but got Syntax errors.
import pandas as pd
query = 'select team, count(id) from ' + \
'(select id, max(timestamp) as latest_ts from table1' + \
' where timestamp > "2016-01-01 00:00:00" group by id)' + \
'group by team'
cursor = impala_con.cursor()
cursor.execute('USE history')
cursor.execute(query)
df_result = as_pandas(cursor)
df_result
So I am wondering if this is something can be achieved? If so, what should be the right way to do it? Thanks!

Related

How to build sequence of purchases for each ID?

I want to create a dataframe that shows me the sequence of what users purchasing according to the sequence column. For example this is my current df:
user_id | sequence | product | price
1 | 1 | A | 10
1 | 2 | C | 15
1 | 3 | G | 1
2 | 1 | B | 20
2 | 2 | T | 45
2 | 3 | A | 10
...
I want to convert it to the following format:
user_id | source_product | target_product | cum_total_price
1 | A | C | 25
1 | C | G | 16
2 | B | T | 65
2 | T | A | 75
...
How can I achieve this?
shift + cumsum + groupby.apply:
def seq(g):
g['source_product'] = g['product']
g['target_product'] = g['product'].shift(-1)
g['price'] = g.price.cumsum().shift(-1)
return g[['user_id', 'source_product', 'target_product', 'price']].iloc[:-1]
df.sort_values('sequence').groupby('user_id', group_keys=False).apply(seq)
# user_id source_product target_product price
#0 1 A C 25.0
#1 1 C G 26.0
#3 2 B T 65.0
#4 2 T A 75.0

How to write a single query to run some retrospective aggregation when time window is different for every row?

I am writing some SQL queries to create a dataset for customer churn predictions based on historical service data. Some of the services date back years ago. Small percentage of them churned at some time in the past while others ended up getting renewed. Some of the attributes are based on aggregation of the services that were active when each of the services was active. For example, I want to find out how many total services were active that was under the same account when an individual account was active.
| account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn |
|------------|------------|------------------|-----------------|-------------|-------|
| A1 | S1 | 2018-01 | 2019-03 | 2019-02 | 1 |
| A2 | S2 | 2016-05 | 2020-04 | 2020-03 | 0 |
| A1 | S3 | 2018-07 | 2019-07 | 2019-06 | 0 |
| A1 | S4 | 2016-11 | 2021-03 | 2021-02 | 0 |
| A1 | S5 | 2018-01 | 2019-01 | 2018-12 | 1 |
| A2 | S6 | 2019-01 | 2021-08 | 2021-07 | 1 |
The total active services for service S1 and S3 under the same account A are 4 and 2 respectively because they have different ExtractDate. I am currently running SQL query with Python wrapper with account_no and extractDate as input arguments. And then loop through the whole dataset and run the queries for each row
def stats_per_account(conn, account, till_date):
query = """select count(distinct ServiceNo) from ServicesAll
where AccountNo = '{0}' and
InstallationDate < '{1}' and
(BillingStopDate is NULL or BillingStopDate >= '{2}')""".format(account, till_date, till_date)
with conn.cursor() as cursor:
cursor.execute(query)
row = cursor.fetchone()
return row
sno = []
ano = []
svc = []
for _, row in tqdm(iter_df.iterrows(), total=iter_df.shape[0]):
cnt = stats_per_account(conn, row['AccountNo'], row['ExtractDate'])
sno.append(row['ServiceNo'])
ano.append(row['AccountNo'])
svc.append(cnt[0])
Since the queries are run one after another sequentially for each row in the dataset. It is very time consuming. I wonder if there is a more efficient way of doing this. Maybe a single query for all the rows altogether?
Here's a rough example you might be able to follow. I use a WITH clause to provide the set of parameters to apply, but you could also store this in a base table to use in the JOIN.
If you have an algorithmic way to generate the sets of parameters, that could be used in a derived table or CTE term to provide as many sets as you wish. We could even use recursive behavior, if your database supports it.
A list of date ranges and list of accounts could easily be used to generate sets of parameters.
The data:
SELECT * FROM ServicesAll;
+------------+------------+------------------+-----------------+-------------+-------+
| account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn |
+------------+------------+------------------+-----------------+-------------+-------+
| A1 | S1 | 2018-01-01 | 2019-03-01 | 2019-02-01 | 1 |
| A2 | S2 | 2016-05-01 | 2020-04-01 | 2020-03-01 | 0 |
| A1 | S3 | 2018-07-01 | 2019-07-01 | 2019-06-01 | 0 |
| A1 | S4 | 2016-11-01 | 2021-03-01 | 2021-02-01 | 0 |
| A1 | S5 | 2018-01-01 | 2019-01-01 | 2018-12-01 | 1 |
| A2 | S6 | 2019-01-01 | 2021-08-01 | 2021-07-01 | 1 |
+------------+------------+------------------+-----------------+-------------+-------+
Here's an approach which generates lists of accounts and date ranges to derive the sets of parameters to apply, and then applies them all in the same query:
WITH RECURSIVE accounts (account_no) AS ( SELECT DISTINCT account_no FROM ServicesAll)
, startDate (start_date) AS ( SELECT current_date - INTERVAL '5' YEAR )
, ranges (start_date, end_date, n) AS (
SELECT start_date, start_date + INTERVAL '6' MONTH, 1 FROM startDate UNION ALL
SELECT start_date + INTERVAL '6' MONTH
, end_date + INTERVAL '6' MONTH, n+1 FROM ranges WHERE n < 8
)
, args (p0, p1, p2) AS (
SELECT account_no, start_date, end_date
FROM accounts, ranges
)
SELECT p0, p1, p2, COUNT(DISTINCT service_no)
FROM ServicesAll, args
WHERE account_no = args.p0
AND InstallationDate < args.p1
AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2)
GROUP BY p0, p1, p2
;
The result:
+------+------------+------------+----------------------------+
| p0 | p1 | p2 | COUNT(DISTINCT service_no) |
+------+------------+------------+----------------------------+
| A1 | 2017-02-17 | 2017-08-17 | 1 |
| A1 | 2017-08-17 | 2018-02-17 | 1 |
| A1 | 2018-02-17 | 2018-08-17 | 3 |
| A1 | 2018-08-17 | 2019-02-17 | 3 |
| A1 | 2019-02-17 | 2019-08-17 | 1 |
| A1 | 2019-08-17 | 2020-02-17 | 1 |
| A1 | 2020-02-17 | 2020-08-17 | 1 |
| A2 | 2016-08-17 | 2017-02-17 | 1 |
| A2 | 2017-02-17 | 2017-08-17 | 1 |
| A2 | 2017-08-17 | 2018-02-17 | 1 |
| A2 | 2018-02-17 | 2018-08-17 | 1 |
| A2 | 2018-08-17 | 2019-02-17 | 1 |
| A2 | 2019-02-17 | 2019-08-17 | 2 |
| A2 | 2019-08-17 | 2020-02-17 | 2 |
| A2 | 2020-02-17 | 2020-08-17 | 1 |
+------+------------+------------+----------------------------+
The following SQL provide sets of parameters in the args CTE term, and then JOINs with that CTE term in the final query expression. The key is the GROUP BY clause which produces a result row for each set of parameters.
WITH args (p0, p1, p2) AS (
SELECT 'A1', '2018-07-02', '2018-07-02' UNION
SELECT 'A1', '2016-11-02', '2016-11-02' UNION
SELECT 'A2', '2016-11-02', '2016-11-02'
)
SELECT p0, p1, p2, COUNT(DISTINCT service_no)
FROM ServicesAll, args
WHERE account_no = args.p0
AND InstallationDate < args.p1
AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2)
GROUP BY p0, p1, p2
;
Result:
+----+------------+------------+----------------------------+
| p0 | p1 | p2 | COUNT(DISTINCT service_no) |
+----+------------+------------+----------------------------+
| A1 | 2016-11-02 | 2016-11-02 | 1 |
| A1 | 2018-07-02 | 2018-07-02 | 4 |
| A2 | 2016-11-02 | 2016-11-02 | 1 |
+----+------------+------------+----------------------------+

How to dimensionalize a pandas dataframe

I'm looking for a more elegant way of doing this, other than a for-loop and unpacking manually...
Imagine I have a dataframe that looks like this
| id | value | date | name |
| -- | ----- | ---------- | ---- |
| 1 | 5 | 2021-04-05 | foo |
| 1 | 6 | 2021-04-06 | foo |
| 5 | 7 | 2021-04-05 | bar |
| 5 | 9 | 2021-04-06 | bar |
If I wanted to dimensionalize this, I could split it up into two different tables. One, perhaps, would contain "meta" information about the person, and the other serving as "records" that would all relate back to one person... a pretty simple idea as far as SQL-ian ideas go...
The resulting tables would look like this...
Meta
| id | name |
| -- | ---- |
| 1 | foo |
| 5 | bar |
Records
| id | value | date |
| -- | ----- | ---------- |
| 1 | 5 | 2021-04-05 |
| 1 | 6 | 2021-04-06 |
| 5 | 7 | 2021-04-05 |
| 5 | 9 | 2021-04-06 |
My question is, how can I achieve this "dimensionalizing" of a dataframe with pandas, without having to write a for loop on the unique id key field and unpacking manually?
Think about this not as "splitting" the existing dataframe, but as creating two new dataframes from the original. You can do this in a couple of lines:
meta = df[['id','name']].drop_duplicates() #Select the relevant columns and remove duplicates
records = df.drop("name", axis=1) #Replicate the original dataframe but drop the name column
You could drop_duplicates based off a subset of columns for the columns you want to keep. For the second dataframe, you can drop the name column:
df1 = df.drop_duplicates(['id', 'name']).loc[:,['id', 'name']] # perigon's answer is simpler with df[['id','name']].drop_duplicates()
df2 = df.drop('name', axis=1)
df1, df2
Output:
( id name
0 1 foo
2 5 bar,
id value date
0 1 5 2021-04-05
1 1 6 2021-04-06
2 5 7 2021-04-05
3 5 9 2021-04-06)

SQL select specific column which are present for all date range

How should I write a SQL queries to find all unique value of a column that are present in all date range.
+-------------+--------+------------+
| primary_key | column | date |
+-------------+--------+------------+
| 1 | a | 2020-03-01 |
| 2 | a | 2020-03-02 |
| 3 | a | 2020-03-03 |
| 4 | a | 2020-03-04 |
| 5 | b | 2020-03-01 |
| 6 | b | 2020-03-02 |
| 7 | b | 2020-03-03 |
| 8 | b | 2020-03-04 |
| 9 | c | 2020-03-01 |
| 10 | c | 2020-03-02 |
| 11 | c | 2020-03-03 |
| 12 | d | 2020-03-04 |
+-------------+--------+------------+
In the above example if query date range is 2020-03-01 to 2020-03-04 output should be
a
b
since only a and b are present for that range
similarly if query date range is 2020-03-01 to 2020-03-03 output should be
a
b
c
I could do this in an python script by fetching all rows and using a set.
Is is possible to write a SQL query yo achieve same result?
You may aggregate by column value and then assert the distinct date count:
SELECT col
FROM yourTable
WHERE date BETWEEN '2020-03-01' AND '2020-03-04'
GROUP BY col
HAVING COUNT(DISTINCT date) = 4;
One More way to solve above Problem.
select data from (
select data,count(data) as datacnt from unique_value group by data ) a,
(select count(distinct present_date ) as cnt from unique_value) b
where a.datacnt=b.cnt;

SQL / Python - how to return count for each attribute and sub-attribute from another table

I have SELECT that returns table which has:
-5 possible values for region (from 1 to 5) and
-3 possible values for age (1-3) with 2 possible values (1 or 2) for gender for each age group.
So table 1. looks something like this:
+----------+-----------+--------------+---------------+---------+
| att_name | att_value | sub_att_name | sub_att_value | percent |
+----------+-----------+--------------+---------------+---------+
| region | 1 | NULL | 0 | 34 |
| region | 2 | NULL | 0 | 22 |
| region | 3 | NULL | 0 | 15 |
| region | 4 | NULL | 0 | 37 |
| region | 5 | NULL | 0 | 12 |
| age | 1 | gender | 1 | 28 |
| age | 1 | gender | 2 | 8 |
| age | 2 | gender | 1 | 13 |
| age | 2 | gender | 2 | 45 |
| age | 3 | gender | 1 | 34 |
| age | 3 | gender | 2 | 34 |
+----------+-----------+--------------+---------------+---------+
Second table holds records with values from table 1. where table 1. unique values for att_name and sub_att_name are table 2. attributes:
+--------+-----+-----+
| region | age | gen |
+--------+-----+-----+
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 3 | 3 | 2 |
| 1 | 3 | 1 |
| 4 | 2 | 2 |
| 5 | 2 | 1 |
+--------+-----+-----+
I want to return count of each unique values for region and age/gender attributes from second table.
Final result should look like this:
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| att_name | att_value | att_value_count | sub_att_name | sub_att_value | sub_att_value_count | percent |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| region | 1 | 1 | NULL | 0 | NULL | 34 |
| region | 2 | 1 | NULL | 0 | NULL | 22 |
| region | 3 | 2 | NULL | 0 | NULL | 15 |
| region | 4 | 1 | NULL | 0 | NULL | 37 |
| region | 5 | 1 | NULL | 0 | NULL | 12 |
| age | 1 | NULL | gender | 1 | 0 | 28 |
| age | 1 | NULL | gender | 2 | 1 | 8 |
| age | 2 | NULL | gender | 1 | 2 | 13 |
| age | 2 | NULL | gender | 2 | 1 | 45 |
| age | 3 | NULL | gender | 1 | 1 | 34 |
| age | 3 | NULL | gender | 2 | 1 | 34 |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
Explanation
Region - doesn't have sub attribute so sub_att_name and sub_att_value_count are NULL.
att_value_count - counts appearance of each unique region (1 for all regions except for region 3 which shows 2 times).
Age/sex - counts combinations of appearance of age and sex (groups are 1/1, 1/2, 2/1, 2/2 and 3/1, 3/2).
Since we need to fill in values only for combinations att_value_count is NULL.
I'm tagging python and pandas in this question since I don't know if this is possible in SQL at all...i hope it is since we are using analytical tools to pull tables and views from database more naturally.
EDIT
SQL - answers looks complicated, I'll test and see if it works tomorrow.
Python - seems more appealing now - is there a way to parse att_name and sub_att_name, find 1 level and 2 level attributes and act accordingly? I think this is only possible with python and we do have different attributes and attributes levels.
I'l already thankful for given answers!
I think this is good enough to solve the issue:
data_1 = {'att_name':['region','region','region','region','region','age','age','age','age','age','age'],'att_value':[1,2,3,4,5,1,1,2,2,3,3],'sub_att_name':[np.nan,np.nan,np.nan,np.nan,np.nan,'gender','gender','gender','gender','gender','gender'],'sub_att_value':[0,0,0,0,0,1,2,1,2,1,2],'percent':[34,22,15,37,12,28,8,13,45,34,34]}
df_1 = pd.DataFrame(data_1)
data_2 = {'region':[2,3,3,1,4,5],'age':[2,1,3,3,2,2],'gen':[1,2,2,1,2,1]}
df_2 = pd.DataFrame(data_2)
df_2_grouped = df_2.groupby(['age','gen'],as_index=False).agg({'region':'count'}).rename(columns={'region':'counts'})
df_final = df_1.merge(df_2_grouped,how='left',left_on=['att_value','sub_att_value'],right_on=['age','gen']).drop(columns=['age','gen']).rename(columns={'counts':'sub_att_value_counts'}
Output of df_final:
att_name att_value sub_att_name sub_att_value percent sub_at_value_count
0 region 1 NaN 0 34 NaN
1 region 2 NaN 0 22 NaN
2 region 3 NaN 0 15 NaN
3 region 4 NaN 0 37 NaN
4 region 5 NaN 0 12 NaN
5 age 1 gender 1 28 NaN
6 age 1 gender 2 8 1.0
7 age 2 gender 1 13 2.0
8 age 2 gender 2 45 1.0
9 age 3 gender 1 34 1.0
10 age 3 gender 2 34 1.0
This is a pandas solution, basically, lookup or map.
df['att_value_count'] = np.nan
s = df['att_name'].eq('region')
df.loc[s, 'att_value_count'] = df.loc[s,'att_value'].map(df2['region'].value_counts())
# step 2
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df['sub_att_value_count'] = np.nan
tmp = df.loc[~s, ['att_value','sub_att_value']]
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df.loc[~s, 'sub_att_value_count'] = counts.lookup(tmp['att_value'], tmp['sub_att_value'])
You can also use merge so as it is more SQL friendly. For example, in step 2:
counts = df2.groupby('age')['gen'].value_counts().reset_index(name='sub_att_value_count')
(df.merge(counts,
left_on=['att_value','sub_att_value'],
right_on=['age','gen'],
how = 'outer'
)
.drop(['age','gen'], axis=1)
)
Output:
att_name att_value sub_att_name sub_att_value percent att_value_count sub_att_value_count
-- ---------- ----------- -------------- --------------- --------- ----------------- ---------------------
0 region 1 nan 0 34 1 nan
1 region 2 nan 0 22 1 nan
2 region 3 nan 0 15 2 nan
3 region 4 nan 0 37 1 nan
4 region 5 nan 0 12 1 nan
5 age 1 gender 1 28 nan 0
6 age 1 gender 2 8 nan 1
7 age 2 gender 1 13 nan 2
8 age 2 gender 2 45 nan 1
9 age 3 gender 1 34 nan 1
10 age 3 gender 2 34 nan 1
Update: Excuse my SQL skill if this doesn't run (it should though)
select
b.*
c.sub_att_value_count
from
(select
df1.*
a.att_value_count
from
(select
region, count(*) as att_value_count
from df2
group by region
) as a
full outer join df1
where df1.att_value = a.region
) as b
full outer join
(
select
age, gender, count(*) as sub_att_value_count
from df2
group by age, gender
) as c
where b.att_value = c.age and b.sub_att_value = c.gender

Categories