Groupby to compare, identify, and make notes on max date - python

I am working with the following table:
+------+------+------+------+---------------+---------+-------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+-------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | |
| 1 | 103 | 2020 | SMT | | | |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | | |
| 2 | 202 | 2022 | SMT | | | |
| 2 | 203 | 2022 | SMT | | | |
+------+------+------+------+---------------+---------+-------+
I am trying to perform the following steps using a df.query() but since there are so many caveats I am not sure how to fit them all in.
Step 1: Only looking at Type == "SMT" or Type == "CMT", group by ID 1 and identify latest date, compare this (grouped ID 1 data) to date of Marked_Latest == "Latest (essentially, just verifying that the date is correct)
Step 2: If the date values are the same, do nothing. If different, then supply ID 2 next to original Marked_Latest == "Latest" in Updated
Step 3: If multiple Latest have the same max Date, put a note in Notes that says "multiple".
This will result in the following table:
+------+------+------+------+---------------+---------+----------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+----------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | multiple |
| 1 | 103 | 2020 | SMT | | | multiple |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | 203 | |
| 2 | 202 | 2022 | SMT | | | multiple |
| 2 | 203 | 2022 | SMT | | | multiple |
+------+------+------+------+---------------+---------+----------+
To summarize: check that the latest date is actually marked as latest date. If it is not marked as latest date, write the updated ID 2 next to the original (incorrect) latest date. And when there are multiple cases of latest date, inputting "multiple" for each ID of latest date.
I have gotten only as far as identifying the actual latest date, using
q = df.query('Type' == "SMT" or 'Type' == "CMT").groupby('ID 1').last('ID 2')
q
This will return a subset with the latest dates marked, but I am not sure how to proceed from here, i.e. how to now compare this dataframe with the date field corresponding to Marked_Latest.
All help appreciated.

Use:
#ID from ID 1 only if match conditions
df['ID'] = df['ID 1'].where(df['Type'].isin(['SMT','CMT']))
#get last Date, ID 2 per `ID` to columns Notes, Updates
df[['Notes', 'Updated']] = df.groupby('ID')[['Date', 'ID 2']].transform('last')
#comapre latest date in Notes with original Date
m1 = df['Notes'].ne(df['Date'])
#if no match set empty string
df['Updated'] = df['Updated'].where(m1 & df['Marked_Latest'].eq('Latest'), '')
#if latest date is duplicated set value multiple
df['Notes'] = np.where(df.duplicated(['ID 1','Date'], keep=False) & ~m1, 'multiple','')
df = df.drop('ID', axis=1)
print (df)
ID 1 ID 2 Date Type Marked_Latest Updated Notes
0 1 100 2001 SMT NaN
1 1 101 2005 SMT NaN
2 1 102 2020 SMT Latest multiple
3 1 103 2020 SMT NaN multiple
4 1 103 2020 ABT NaN
5 2 201 2009 CMT Latest 203.0
6 2 202 2022 SMT NaN multiple
7 2 203 2022 SMT NaN multiple

Try:
cols = ['ID 1', 'ID 2', 'Date', 'Type', 'Marked_Latest', 'Updated', 'Notes']
data = [[1, 100, 2001, 'SMT', '', '', ''],
[1, 101, 2005, 'SMT', '', '', ''],
[1, 102, 2020, 'SMT', 'Latest', '', ''],
[1, 103, 2020, 'SMT', '', '', ''],
[1, 103, 2020, 'ABT', '', '', '']]
df = pd.DataFrame(data, columns = cols)
temp = df[(df['Type'] == "SMT")|(df['Type'] == "CMT")]
new = temp.groupby('ID 1')['ID 2'].last().values[0]
latest = temp[temp['Marked_Latest'] == 'Latest']
nind = temp[temp['ID 2'] == new].index
if new != latest['ID 2'].values[0]:
df.loc[latest.index,'Updated']=new
df.loc[latest.index, 'Notes'] = 'multiple'
df.loc[nind, 'Notes'] = 'multiple'
Output:

Related

Pandas Dataframe keep rows where values of 2 columns are in a list of couples

I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1

Filtering out rows that don't meet specific order and value criteria in Python or tSQL?

I need some help filtering rows out of a customer dataset I've created.
The dataset contains customer IDs, policy numbers, and the dates related to their policies. Customers can switch freely between policies, anytime they wish. The following dataset is all just an example dataset I put together. I can use either pandas or sql server to filter out the right customers.
Objective:
I want to filter the dataset and retrieve customers under the following conditions:
Customer must have chronologically been on Policy rate 13, then switched to 11.
Customers must have atleast 350 days on both policies.
I've included a column (policy_order) showing the order active policies. It doesn't matter when the 13 => 11 switch occurred, as long as the jump was from 13 to 11, and they spent 350 days on each.
| row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy |
|-----|---------|------------|--------------|------------|--------------|----------------|
| 1 | 1000 | 17 | 09/23/2013 | 11/05/2013 | 1 | 43 |
| 2 | 1200 | 13 | 08/26/2011 | 04/30/2019 | 1 | 2804 |
| 3 | 3400 | 13 | 08/31/2012 | 02/22/2015 | 1 | 905 |
| 4 | 5000 | 17 | 04/12/2014 | 07/28/2014 | 1 | 107 |
| 5 | 5000 | 13 | 07/28/2014 | 08/24/2016 | 2 | 758 |
| 6 | 5000 | 11 | 08/24/2016 | 10/20/2018 | 3 | 787 |
| 7 | 5000 | 13 | 10/20/2018 | 05/02/2019 | 4 | 194 |
| 8 | 7600 | 13 | 02/02/2015 | 05/03/2019 | 1 | 1551 |
| 9 | 4300 | 11 | 01/07/2015 | 05/04/2017 | 1 | 848 |
| 10 | 4300 | 13 | 05/04/2017 | 05/05/2019 | 2 | 731 |
| 11 | 9800 | 13 | 12/12/2001 | 10/06/2015 | 1 | 5046 |
| 12 | 9800 | 11 | 10/06/2015 | 05/06/2019 | 2 | 1308 |
As seen in the table above, two customers match the criteria. Customer 5000, and customer 9800. I used customer 5000 as an example, because they've switched policies multiple times but still meet the criteria in rows 5 and 6. These are the only rows I'm concerned with.
So the output that I would want to see would look like this:
| row | acct | policy_num | policy_start | policy_end | policy_order | days_on_policy |
|-----|------|------------|--------------|------------|--------------|----------------|
| 1 | 5000 | 13 | 7/28/2014 | 8/24/2016 | 2 | 758 |
| 2 | 5000 | 11 | 8/24/2016 | 10/20/2018 | 3 | 787 |
| 3 | 9800 | 13 | 12/12/2001 | 10/6/2015 | 1 | 5046 |
| 4 | 9800 | 11 | 10/6/2015 | 5/6/2019 | 2 | 1308 |
The results would show the customer ID, the correct policy numbers, relevant dates, and how many days they were on each policy.
I've tried filtering using the WHERE clause in SQL (which I'm admittedly bad at), but haven't even come close to an answer - and don't even really know where to start.
My main goal is to try and get the rows filtered using order, policy number, and days on policy.
Any and all help is greatly appreciated!
If you want a solution based on Pandas, then define the following
filtering function:
def fltr(gr):
wrk = gr.query('policy_num in [11, 13]').sort_values(['policy_order'])
pNum = wrk.set_index('policy_order').policy_num
if ~((pNum == 11).any() and (pNum == 13).any()):
return None
ind11 = pNum[pNum == 11].index[0]
ind13 = pNum[pNum == 13].index[0]
if ind13 > ind11:
return None
if (wrk.groupby('policy_num').days_on_policy.sum() >= 350).all():
return wrk.drop_duplicates(subset='policy_num')
return None
Then use it in groupby:
df.groupby('cust_id').apply(fltr)
A short description of the filtering function
It starts with computing auxiliary variables:
wrk - rows of the current group for policy_num == either 11
or 13, ordered by policy_order.
pNum - policy_num column from wrk, indexed by policy_order.
The filtering function has 2 "initial" occasions to return the empty content
(None), to reject the current group:
pNum failed to contain at least one 11 and at least one 13.
The index (actually policy_order) of the first 13 element in pNum
is greater than the index of the first 11 element (policy 13
follows policy 11).
The last decision is based on a question: Does each of the policies
in question (11 and 13) have the sum of days_on_policy >= 350?
If yes, the function returns rows from wrk without repetitions,
to drop possible last 13 (as in the case of group 5000).
Otherwise, the current group is also rejected.
Here is what I guess you would need.
SELECT *
FROM policy p1
WHERE policy_num = 13
AND days_on_policy >= 350
AND EXISTS
(SELECT 1 FROM policy p2
WHERE p1.cust_id = p2.cust_id
AND p2.policy_num =11
AND p2.policy_start >= p1.policy_end
AND p2.days_on_policy >= 350)
UNION ALL
SELECT *
FROM policy p1
where policy_num = 11
AND days_on_policy >= 350
AND EXISTS
(SELECT 1 FROM policy p2
WHERE p1.cust_id = p2.cust_id
AND p2.policy_num =13
AND p1.policy_start >= p2.policy_end
AND p2.days_on_policy >= 350)
SQLFiddler
It is nearly always better to do the filtering of data within the query, unless performance of the database is affected by the query.
If your dataset isn't to large this is the procedure I would use to filter.
#filter on the criteria for the policy number
df_13_fltr = df[(df['policy_num']==13)&\
(df['days_on_policy']>=350)][['row','cust_id','policy_end']]
df_11_fltr = df[(df['policy_num']==11)&\
(df['days_on_policy']>=350)][['row','cust_id','policy_start']]
#merge the 2 filtered DataFrames together and compare the policy_end and policy_start
df_fltr = df_11_fltr.merge(df_13_fltr, on='cust_id',how='inner',suffixes=('13','11'))
df_fltr =df_fltr[df_fltr['policy_end']<=df_fltr['policy_start']][['row13','row11']]
#put the rows in a list
rows = list(df_fltr['row13'].values)+list(df_fltr['row11'])
#using the rows list in a lambda filter on the original dataset
df[df['row'].apply(lambda x: x in rows)]
With a self join and the conditions applied on the ON clause:
select t1.*
from tablename t1 inner join tablename t2
on
t2.cust_id = t1.cust_id
and (
(t2.policy_start = t1.policy_end) and (t1.policy_num = 13 and t2.policy_num = 11)
or
(t1.policy_start = t2.policy_end) and (t2.policy_num = 13 and t1.policy_num = 11)
)
and t1.days_on_policy >= 350 and t2.days_on_policy >= 350
order by t1.cust_id, t1.policy_start
See the demo.
Results:
> row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy
> --: | ------: | ---------: | :------------------ | :------------------ | -----------: | -------------:
> 5 | 5000 | 13 | 28/07/2014 00:00:00 | 24/08/2016 00:00:00 | 2 | 758
> 6 | 5000 | 11 | 24/08/2016 00:00:00 | 20/10/2018 00:00:00 | 3 | 787
> 11 | 9800 | 13 | 12/12/2001 00:00:00 | 06/10/2015 00:00:00 | 1 | 5046
> 12 | 9800 | 11 | 06/10/2015 00:00:00 | 06/05/2019 00:00:00 | 2 | 1308
I used a groupby on cust_id and a rolling window to look back on the policy_num to find 11 current and 13 previous. I originally thought to create a filter on 350 days but commented it out because it could break the sequence of policy_num
data = """
| row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy |
| 1 | 1000 | 17 | 09/23/2013 | 11/05/2013 | 1 | 43 |
| 2 | 1200 | 13 | 08/26/2011 | 04/30/2019 | 1 | 2804 |
| 3 | 3400 | 13 | 08/31/2012 | 02/22/2015 | 1 | 905 |
| 4 | 5000 | 17 | 04/12/2014 | 07/28/2014 | 1 | 107 |
| 5 | 5000 | 13 | 07/28/2014 | 08/24/2016 | 2 | 758 |
| 6 | 5000 | 11 | 08/24/2016 | 10/20/2018 | 3 | 787 |
| 7 | 5000 | 13 | 10/20/2018 | 05/02/2019 | 4 | 194 |
| 8 | 7600 | 13 | 02/02/2015 | 05/03/2019 | 1 | 1551 |
| 9 | 4300 | 11 | 01/07/2015 | 05/04/2017 | 1 | 848 |
| 10 | 4300 | 13 | 05/04/2017 | 05/05/2019 | 2 | 731 |
| 11 | 9800 | 13 | 12/12/2001 | 10/06/2015 | 1 | 5046 |
| 12 | 9800 | 11 | 10/06/2015 | 05/06/2019 | 2 | 1308 |
"""
data = data.strip().split('\n')
data = [i.strip().split('|') for i in data]
data = [i[1:-1] for i in data]
columns=[data.strip() for data in data[0]]
df = pd.DataFrame(data[1:], columns=columns)
print(df.columns)
df.set_index(['row'],inplace=True)
# set the datatypes for each column
df['cust_id'] = df['cust_id'].astype(int)
df['policy_num'] = df['policy_num'].astype(int)
df['policy_start'] = pd.to_datetime(df['policy_start'])
df['policy_end'] = pd.to_datetime(df['policy_end'])
df['policy_order'] = df['policy_order'].astype(int)
df['days_on_policy'] = df['days_on_policy'].astype(int)
#print(df)
def create_filter(df, filter_cols, filter_values,operator_values):
filter_list = []
for col, val,operator in zip(filter_cols, filter_values,operator_values):
if operator=='>':
filter_list.append(df[col] > val)
elif operator=='>=':
filter_list.append(df[col] >= val)
elif operator=='<':
filter_list.append(df[col] < val)
elif operator=='<=':
filter_list.append(df[col] <= val)
elif operator=='==':
filter_list.append(df[col] == val)
return pd.concat(filter_list, axis=1).all(axis=1)
#filter_cols=['days_on_policy']
#filter_values=[350]
#operator_values=['>']
#filter=create_filter(df, filter_cols, filter_values,operator_values)
#df=df[filter]
df = df.sort_values(by=['cust_id','policy_order'], ascending=False)
#print(df)
df_grouped = df.groupby('cust_id')
rolling_df=df_grouped.rolling(window=1).sum()
prev_key,prev_policy_num,prev_days_on_policy=tuple(),"",""
prev_key=None
for key,item in rolling_df.iterrows():
policy_num=item['policy_num']
days_on_policy=item['days_on_policy']
if prev_key!=None:
prev_policy_num,prev_days_on_policy=rolling_df.loc[prev_key] [['policy_num','days_on_policy']]
if key[0]==prev_key[0] and policy_num==13 and prev_policy_num==11 and prev_days_on_policy>350 and days_on_policy>350:
print(prev_key[0],prev_policy_num)
prev_key=key
output:
5000 11.0
9800 11.0

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

sqlalchemy how to divide 2 columns from different table

I have 2 tables named as company_info and company_income:
company_info :
| id | company_name | staff_num | year |
|----|--------------|-----------|------|
| 0 | A | 10 | 2010 |
| 1 | A | 10 | 2011 |
| 2 | A | 20 | 2012 |
| 3 | B | 20 | 2010 |
| 4 | B | 5 | 2011 |
company_income :
| id | company_name | income | year |
|----|--------------|--------|------|
| 0 | A | 10 | 2010 |
| 1 | A | 20 | 2011 |
| 2 | A | 30 | 2012 |
| 3 | B | 20 | 2010 |
| 4 | B | 15 | 2011 |
Now I want to calculate average staff income of each company, the result looks like this:
result :
| id | company_name | avg_income | year |
|----|--------------|------------|------|
| 0 | A | 1 | 2010 |
| 1 | A | 2 | 2011 |
| 2 | A | 1.5 | 2012 |
| 3 | B | 1 | 2010 |
| 4 | B | 3 | 2011 |
how to get this result using python SQLalchemy ? The database of the table is MySQL.
Join the tables and do a standard sum. You'd want to either set yourself up a view in MySQL with this query or create straight in your program.
SELECT
a.CompanyName,
a.year,
(a.staff_num / b.income) as avg_income
FROM
company_info as a
LEFT JOIN
company_income as b
ON
a.company_name = b.company_name
AND
a.year = b.year
You'd want a few wheres as well (such as where staff_num is not null or not equal to 0 and same as income. Also if you can have multiple values for the same company / year in both columns then you'll want to do a SUM of the values in the column, then group by companyname and year)
Try this:
SELECT
info.company_name,
(inc.income / info.staff_num) as avg,
info.year
FROM
company_info info JOIN company_income inc
ON
info.company_name = inc.company_name
AND
info.year = inc.year

How do I get the change from the same quarter in the previous year in a pandas datatable grouped by more than 1 column

I have a datatable that looks like this (but with more than 1 country and many more years worth of data):
| Country | Year | Quarter | Amount |
-------------------------------------------
| UK | 2014 | 1 | 200 |
| UK | 2014 | 2 | 250 |
| UK | 2014 | 3 | 200 |
| UK | 2014 | 4 | 150 |
| UK | 2015 | 1 | 230 |
| UK | 2015 | 2 | 200 |
| UK | 2015 | 3 | 200 |
| UK | 2015 | 4 | 160 |
-------------------------------------------
I want to get the change for each row from the same quarter in the previous year. So for the first 4 rows in the example the change would be null (because there is no previous data for that quarter). For 2015 quarter 1, the difference would be 30 (because quarter 1 for the previous year is 200, so 230 - 200 = 30). So the data table I'm trying to get is:
| Country | Year | Quarter | Amount | Change |
---------------------------------------------------|
| UK | 2014 | 1 | 200 | NaN |
| UK | 2014 | 2 | 250 | NaN |
| UK | 2014 | 3 | 200 | NaN |
| UK | 2014 | 4 | 150 | NaN |
| UK | 2015 | 1 | 230 | 30 |
| UK | 2015 | 2 | 200 | -50 |
| UK | 2015 | 3 | 200 | 0 |
| UK | 2015 | 4 | 160 | 10 |
---------------------------------------------------|
From looking at other questions I've tried using the .diff() method but I'm not quite sure how to get it to do what I want (or if I'll actually need to do something more brute force to work this out), e.g. I've tried:
df.groupby(by=["Country", "Year", "Quarter"]).sum().diff().head(10)
This yields the difference from the previous row in the table as a whole though, rather than the difference from the same quarter for the previous year.
Since you want the change over Country and quarter and not the year, you have to remove the year from the group.
df['Change'] = df.groupby(['Country', 'Quarter']).Amount.diff()

Categories