SQL select specific column which are present for all date range - python

How should I write a SQL queries to find all unique value of a column that are present in all date range.
+-------------+--------+------------+
| primary_key | column | date |
+-------------+--------+------------+
| 1 | a | 2020-03-01 |
| 2 | a | 2020-03-02 |
| 3 | a | 2020-03-03 |
| 4 | a | 2020-03-04 |
| 5 | b | 2020-03-01 |
| 6 | b | 2020-03-02 |
| 7 | b | 2020-03-03 |
| 8 | b | 2020-03-04 |
| 9 | c | 2020-03-01 |
| 10 | c | 2020-03-02 |
| 11 | c | 2020-03-03 |
| 12 | d | 2020-03-04 |
+-------------+--------+------------+
In the above example if query date range is 2020-03-01 to 2020-03-04 output should be
a
b
since only a and b are present for that range
similarly if query date range is 2020-03-01 to 2020-03-03 output should be
a
b
c
I could do this in an python script by fetching all rows and using a set.
Is is possible to write a SQL query yo achieve same result?

You may aggregate by column value and then assert the distinct date count:
SELECT col
FROM yourTable
WHERE date BETWEEN '2020-03-01' AND '2020-03-04'
GROUP BY col
HAVING COUNT(DISTINCT date) = 4;

One More way to solve above Problem.
select data from (
select data,count(data) as datacnt from unique_value group by data ) a,
(select count(distinct present_date ) as cnt from unique_value) b
where a.datacnt=b.cnt;

Related

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

Take the sum of every N rows per group in a pandas DataFrame

Just want to mention that this question is not a duplicate of this: Take the sum of every N rows in pandas series
My problem is a bit different as I want to calculate N rows per group. My current code looks like this:
import pandas as pd
df = pd.DataFrame({'ID':['AA','AA','AA','BB','BB','BB'],
'DATE':['2021-01-01','2021-01-03','2021-01-08','2021-03-04','2021-03-06','2021-03-08'],
'VALUE':[10,15,25,40,60,90]})
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values(by=['ID','DATE'])
df.head(10)
Sample DataFrame:
+----+------------+-------+
| ID | DATE | VALUE |
+----+------------+-------+
| AA | 2021-01-01 | 10 |
+----+------------+-------+
| AA | 2021-01-03 | 15 |
+----+------------+-------+
| AA | 2021-01-08 | 25 |
+----+------------+-------+
| BB | 2021-03-04 | 40 |
+----+------------+-------+
| BB | 2021-03-06 | 60 |
+----+------------+-------+
| BB | 2021-03-08 | 90 |
+----+------------+-------+
I apply this preprocessing based on the post:
#Calculate result
df.groupby(['ID', df.index//2]).agg({'VALUE':'mean', 'DATE':'median'}).reset_index()
I get this:
+----+-------+------------+
| ID | VALUE | DATE |
+----+-------+------------+
| AA | 12.5 | 2021-01-02 |
+----+-------+------------+
| AA | 25 | 2021-01-08 |
+----+-------+------------+
| BB | 40 | 2021-03-04 |
+----+-------+------------+
| BB | 75 | 2021-03-07 |
+----+-------+------------+
But I want this:
+----+-------+------------+
| ID | VALUE | DATE |
+----+-------+------------+
| AA | 12.5 | 2021-01-02 |
+----+-------+------------+
| AA | 25 | 2021-01-08 |
+----+-------+------------+
| BB | 50 | 2021-03-05 |
+----+-------+------------+
| BB | 90 | 2021-03-08 |
+----+-------+------------+
It seems like pandas index does not work well when my groups are not perfectly aligned and it messes up the beginning of the next series and how aggregations happen. Any suggestions? My dates can be completely irregular by the way.
You can use groupby.cumcount to form a subgroup:
N = 2
group = df.groupby('ID').cumcount()//N
out = (df.groupby(['ID', group])
.agg({'VALUE':'mean', 'DATE':'median'})
.droplevel(1).reset_index()
)
output:
ID VALUE DATE
0 AA 12.5 2021-01-02
1 AA 25.0 2021-01-08
2 BB 50.0 2021-03-05
3 BB 90.0 2021-03-08

How to write a single query to run some retrospective aggregation when time window is different for every row?

I am writing some SQL queries to create a dataset for customer churn predictions based on historical service data. Some of the services date back years ago. Small percentage of them churned at some time in the past while others ended up getting renewed. Some of the attributes are based on aggregation of the services that were active when each of the services was active. For example, I want to find out how many total services were active that was under the same account when an individual account was active.
| account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn |
|------------|------------|------------------|-----------------|-------------|-------|
| A1 | S1 | 2018-01 | 2019-03 | 2019-02 | 1 |
| A2 | S2 | 2016-05 | 2020-04 | 2020-03 | 0 |
| A1 | S3 | 2018-07 | 2019-07 | 2019-06 | 0 |
| A1 | S4 | 2016-11 | 2021-03 | 2021-02 | 0 |
| A1 | S5 | 2018-01 | 2019-01 | 2018-12 | 1 |
| A2 | S6 | 2019-01 | 2021-08 | 2021-07 | 1 |
The total active services for service S1 and S3 under the same account A are 4 and 2 respectively because they have different ExtractDate. I am currently running SQL query with Python wrapper with account_no and extractDate as input arguments. And then loop through the whole dataset and run the queries for each row
def stats_per_account(conn, account, till_date):
query = """select count(distinct ServiceNo) from ServicesAll
where AccountNo = '{0}' and
InstallationDate < '{1}' and
(BillingStopDate is NULL or BillingStopDate >= '{2}')""".format(account, till_date, till_date)
with conn.cursor() as cursor:
cursor.execute(query)
row = cursor.fetchone()
return row
sno = []
ano = []
svc = []
for _, row in tqdm(iter_df.iterrows(), total=iter_df.shape[0]):
cnt = stats_per_account(conn, row['AccountNo'], row['ExtractDate'])
sno.append(row['ServiceNo'])
ano.append(row['AccountNo'])
svc.append(cnt[0])
Since the queries are run one after another sequentially for each row in the dataset. It is very time consuming. I wonder if there is a more efficient way of doing this. Maybe a single query for all the rows altogether?
Here's a rough example you might be able to follow. I use a WITH clause to provide the set of parameters to apply, but you could also store this in a base table to use in the JOIN.
If you have an algorithmic way to generate the sets of parameters, that could be used in a derived table or CTE term to provide as many sets as you wish. We could even use recursive behavior, if your database supports it.
A list of date ranges and list of accounts could easily be used to generate sets of parameters.
The data:
SELECT * FROM ServicesAll;
+------------+------------+------------------+-----------------+-------------+-------+
| account_no | service_no | InstallationDate | BillingStopDate | ExtractDate | Churn |
+------------+------------+------------------+-----------------+-------------+-------+
| A1 | S1 | 2018-01-01 | 2019-03-01 | 2019-02-01 | 1 |
| A2 | S2 | 2016-05-01 | 2020-04-01 | 2020-03-01 | 0 |
| A1 | S3 | 2018-07-01 | 2019-07-01 | 2019-06-01 | 0 |
| A1 | S4 | 2016-11-01 | 2021-03-01 | 2021-02-01 | 0 |
| A1 | S5 | 2018-01-01 | 2019-01-01 | 2018-12-01 | 1 |
| A2 | S6 | 2019-01-01 | 2021-08-01 | 2021-07-01 | 1 |
+------------+------------+------------------+-----------------+-------------+-------+
Here's an approach which generates lists of accounts and date ranges to derive the sets of parameters to apply, and then applies them all in the same query:
WITH RECURSIVE accounts (account_no) AS ( SELECT DISTINCT account_no FROM ServicesAll)
, startDate (start_date) AS ( SELECT current_date - INTERVAL '5' YEAR )
, ranges (start_date, end_date, n) AS (
SELECT start_date, start_date + INTERVAL '6' MONTH, 1 FROM startDate UNION ALL
SELECT start_date + INTERVAL '6' MONTH
, end_date + INTERVAL '6' MONTH, n+1 FROM ranges WHERE n < 8
)
, args (p0, p1, p2) AS (
SELECT account_no, start_date, end_date
FROM accounts, ranges
)
SELECT p0, p1, p2, COUNT(DISTINCT service_no)
FROM ServicesAll, args
WHERE account_no = args.p0
AND InstallationDate < args.p1
AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2)
GROUP BY p0, p1, p2
;
The result:
+------+------------+------------+----------------------------+
| p0 | p1 | p2 | COUNT(DISTINCT service_no) |
+------+------------+------------+----------------------------+
| A1 | 2017-02-17 | 2017-08-17 | 1 |
| A1 | 2017-08-17 | 2018-02-17 | 1 |
| A1 | 2018-02-17 | 2018-08-17 | 3 |
| A1 | 2018-08-17 | 2019-02-17 | 3 |
| A1 | 2019-02-17 | 2019-08-17 | 1 |
| A1 | 2019-08-17 | 2020-02-17 | 1 |
| A1 | 2020-02-17 | 2020-08-17 | 1 |
| A2 | 2016-08-17 | 2017-02-17 | 1 |
| A2 | 2017-02-17 | 2017-08-17 | 1 |
| A2 | 2017-08-17 | 2018-02-17 | 1 |
| A2 | 2018-02-17 | 2018-08-17 | 1 |
| A2 | 2018-08-17 | 2019-02-17 | 1 |
| A2 | 2019-02-17 | 2019-08-17 | 2 |
| A2 | 2019-08-17 | 2020-02-17 | 2 |
| A2 | 2020-02-17 | 2020-08-17 | 1 |
+------+------------+------------+----------------------------+
The following SQL provide sets of parameters in the args CTE term, and then JOINs with that CTE term in the final query expression. The key is the GROUP BY clause which produces a result row for each set of parameters.
WITH args (p0, p1, p2) AS (
SELECT 'A1', '2018-07-02', '2018-07-02' UNION
SELECT 'A1', '2016-11-02', '2016-11-02' UNION
SELECT 'A2', '2016-11-02', '2016-11-02'
)
SELECT p0, p1, p2, COUNT(DISTINCT service_no)
FROM ServicesAll, args
WHERE account_no = args.p0
AND InstallationDate < args.p1
AND (BillingStopDate IS NULL OR BillingStopDate >= args.p2)
GROUP BY p0, p1, p2
;
Result:
+----+------------+------------+----------------------------+
| p0 | p1 | p2 | COUNT(DISTINCT service_no) |
+----+------------+------------+----------------------------+
| A1 | 2016-11-02 | 2016-11-02 | 1 |
| A1 | 2018-07-02 | 2018-07-02 | 4 |
| A2 | 2016-11-02 | 2016-11-02 | 1 |
+----+------------+------------+----------------------------+

Reshape long form panel data to wide stacked time series

I have panel data of the form:
+--------+----------+------------+----------+
| | user_id | order_date | values |
+--------+----------+------------+----------+
| 0 | 11039591 | 2017-01-01 | 3277.466 |
| 1 | 25717549 | 2017-01-01 | 587.553 |
| 2 | 13629086 | 2017-01-01 | 501.882 |
| 3 | 3022981 | 2017-01-01 | 1352.546 |
| 4 | 6084613 | 2017-01-01 | 441.151 |
| ... | ... | ... | ... |
| 186415 | 17955698 | 2020-05-01 | 146.868 |
| 186416 | 17384133 | 2020-05-01 | 191.461 |
| 186417 | 28593228 | 2020-05-01 | 207.201 |
| 186418 | 29065953 | 2020-05-01 | 430.401 |
| 186419 | 4470378 | 2020-05-01 | 87.086 |
+--------+----------+------------+----------+
as a Pandas DataFrame in Python.
The data is basically stacked time series data; the table contains numerous time series corresponding to observations for unique users within a certain period (2017/01 - 2020/05 above). The level of coverage for the period is likely to be very low amongst individual users, meaning that if you isolate the individual time series they're all of varying lengths.
I want to take this long-format panel data and convert it to wide format, such that each column is a day and each row corresponds to a unique user:
+----------+------------+------------+------------+------------+------------+
| | 2017-01-01 | 2017-01-02 | 2017-01-03 | 2017-01-04 | 2017-01-05 |
+----------+------------+------------+------------+------------+------------+
| 11039591 | 3277.466 | 6482.722 | NaN | NaN | NaN |
| 25717549 | 587.553 | NaN | NaN | NaN | NaN |
| 13629086 | 501.882 | NaN | NaN | NaN | NaN |
| 3022981 | 1352.546 | NaN | NaN | 557.728 | NaN |
| 6084613 | 441.151 | NaN | NaN | NaN | NaN |
+----------+------------+------------+------------+------------+------------+
I'm struggling to get this using unstack/pivot or other Pandas built-ins as I keep running into:
ValueError: Index contains duplicate entries, cannot reshape
due to the repeated user IDs.
My solution at the moment uses a loop to index the individual timeseries and concatenates them together so it's not scalable - it's already really slow with just 180k rows:
def time_series_stacker(df):
ts = list()
for user in df['user_id'].unique():
values = df.loc[df['user_id']==user].drop('user_id', axis=1).T.values
instance = pd.DataFrame(
values[1,:].reshape(1,-1),
index=[user],
columns=values[0,:].astype('datetime64[ns]')
)
ts.append(instance)
return pd.concat(ts, axis=0)
Can anyone help out with reshaping this more efficiently please?
This is a perfect time to try out pivot_table
user_id order_date values
0 11039591 2017-01-01 3277.466
1 11039591 2017-01-02 587.553
2 13629086 2017-01-03 501.882
3 13629086 2017-01-02 1352.546
4 6084613 2017-01-01 441.151
df.pivot_table(index='user_id',columns='order_date',values='values')
Output
order_date 2017-01-01 2017-01-02 2017-01-03
user_id
6084613 441.151 NaN NaN
11039591 3277.466 587.553 NaN
13629086 NaN 1352.546 501.882

SQL/Impala: Applying another group by on the first group by output

I need to do a group by on top of a group by output. For example, in the below table1:
id | timestamp | team
----------------------------
1 | 2016-01-02 | A
2 | 2016-02-01 | B
1 | 2016-02-04 | A
1 | 2016-03-05 | A
3 | 2016-05-12 | B
3 | 2016-05-15 | B
4 | 2016-07-07 | A
5 | 2016-08-01 | C
6 | 2015-08-01 | C
1 | 2015-04-01 | A
If I do a query:
query = select id, max(timestamp) as latest_ts from table1' + \
' where timestamp > "2016-01-01 00:00:00" group by id'
I would have:
id | latest_ts |
---------------------
2 | 2016-02-01 |
1 | 2016-03-05 |
3 | 2016-05-15 |
4 | 2016-07-07 |
5 | 2016-08-01 |
However, I am wondering if it is possible to include the team column like below as well?
id | latest_ts | team
----------------------------
2 | 2016-02-01 | B
1 | 2016-03-05 | A
3 | 2016-05-15 | B
4 | 2016-07-07 | A
5 | 2016-08-01 | C
Ultimately, what I really need is to know how many distinct id in each team for year 2016. My expected result should be:
team | count(id)
-------------------
A | 2
B | 2
C | 1
I am trying to do another group by on top of the first group by result using the code below, but got Syntax errors.
import pandas as pd
query = 'select team, count(id) from ' + \
'(select id, max(timestamp) as latest_ts from table1' + \
' where timestamp > "2016-01-01 00:00:00" group by id)' + \
'group by team'
cursor = impala_con.cursor()
cursor.execute('USE history')
cursor.execute(query)
df_result = as_pandas(cursor)
df_result
So I am wondering if this is something can be achieved? If so, what should be the right way to do it? Thanks!

Categories