Match and Store Python Dataframe - python

I'm trying to figure out a way to match a given value in a dataframe column to another dataframe column, and then storing an AGE from df1 in df2.
e.g. Matching VAL in df1 to VAL in df2. If the two are equal, store AGE from df1 in AGE df2.
| df1 | VAL | AGE |
|:--- |:---:|----:|
| 0 | 20 | 25 |
| 1 | 10 | 29 |
| 2 | 50 | 21 |
| 4 | 20 | 32 |
| 5 | 00 | 19 |
| df2 | VAL | AGE |
|:--- |:---:|----:|
| 0 | 00 | [] |
| 1 | 10 | [] |
| 2 | 20 | [] |
| 4 | 30 | [] |
| 5 | 40 | [] |
| 6 | 50 | [] |
edit: AGE in df2 stores an array of values rather than a single value

Try:
x = df1.groupby("VAL").agg(list)
df2["AGE"] = df2["VAL"].map(x["AGE"]).fillna({i: [] for i in df2.index})
print(df2)
Prints:
VAL AGE
0 0 [19]
1 10 [29]
2 20 [25, 32]
4 30 []
5 40 []
6 50 [21]

Related

Fetch values corresponding to id of each row python

Is is possible to fetch column containing values corresponding to an id column?
Example:-
df1
| ID | Value | Salary |
|:---------:--------:|:------:|
| 1 | amr | 34 |
| 1 | ith | 67 |
| 2 | oaa | 45 |
| 1 | eea | 78 |
| 3 | anik | 56 |
| 4 | mmkk | 99 |
| 5 | sh_s | 98 |
| 5 | ahhi | 77 |
df2
| ID | Dept |
|:---------:--------:|
| 1 | hrs |
| 1 | cse |
| 2 | me |
| 1 | ece |
| 3 | eee |
Expected Output
| ID | Dept | Value |
|:---------:--------:|----------:|
| 1 | hrs | amr |
| 1 | cse | ith |
| 2 | me | oaa |
| 1 | ece | eea |
| 3 | eee | anik |
I want to fetch each values in the 'Value' column corresponding to values in df2's ID column. And create column containing 'Values' in df2. The number of rows in the two dfs are not the same. I have tried
this
Not worked
IIUC , you can try df.merge after assigning a helper column by doing groupby+cumcount on ID:
out = (df1.assign(k=df1.groupby("ID").cumcount())
.merge(df2.assign(k=df2.groupby("ID").cumcount()),on=['ID','k'])
.drop("k",1))
print(out)
ID Value Dept
0 1 Amr hrs
1 1 ith cse
2 2 oaa me
3 1 eea ece
4 3 anik eee
is this what you want to do?
df1.merge(df2, how='inner',on ='ID')
Since you have duplicated IDs in both dfs, but these are ordered, try:
df1 = df1.drop(columns="ID")
df3 = df2.merge(df1, left_index=True, right_index=True)

Python - Dataframe group by time

There is a Dataframe with this SAMPLE (not original data) of records:
import pandas as pd
df = pd.DataFrame(dikt, columns=['id', 'price', 'day'])
df:
+-------+-----+-------+-----+
| index | id | price | day |
+-------+-----+-------+-----+
| 0 | 34 | 12 | 3 |
+-------+-----+-------+-----+
| 1 | 34 | 6 | 5 |
+-------+-----+-------+-----+
| 2 | 56 | 23 | 8 |
+-------+-----+-------+-----+
| 3 | 56 | 21 | 9 |
+-------+-----+-------+-----+
| 4 | 56 | 67 | 22 |
+-------+-----+-------+-----+
| ... | ... | ... | |
+-------+-----+-------+-----+
I want to group the price in a week like this:
+-------+-----+---------------------+
| index | id | price |
+-------+-----+---------------------+
| 0 | 34 | [12, 6] |
+-------+-----+---------------------+
| 1 | 56 | [23, 21], [67] |
+-------+-----+---------------------+
| ... | ... | ... |
+-------+-----+---------------------+
In the above table, the prices were grouped by their day. For example 12 and 6 are in 3 and 5 day that can be in the first week. So they are together, and so on.
Divide the day by 7 and add a column for the week number and group it into that unit. Which grouped data frames will be combined in a grouping without the week number.
df['weeknum'] = df['day'] // 7
df2 = df.groupby(['id','weeknum'])['price'].agg(list).to_frame()
df2['price'] = df2['price'].astype(str)
df2.groupby('id')['price'].agg(','.join).to_frame()
price
id
34 [12, 6]
56 [23, 21],[67]

Find top N values within each group

I have a dataset similar to the sample below:
| id | size | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6 | small | 3 | 0 | 21 | 0 |
| 6 | small | 9 | 0 | 23 | 0 |
| 13 | medium | 3 | 0 | 12 | 0 |
| 13 | medium | 37 | 0 | 20 | 1 |
| 20 | medium | 30 | 0 | 5 | 6 |
| 20 | medium | 12 | 2 | 3 | 0 |
| 12 | small | 7 | 0 | 2 | 0 |
| 10 | small | 8 | 0 | 12 | 0 |
| 15 | small | 19 | 0 | 3 | 0 |
| 15 | small | 54 | 0 | 8 | 0 |
| 87 | medium | 6 | 0 | 9 | 0 |
| 90 | medium | 11 | 1 | 16 | 0 |
| 90 | medium | 25 | 0 | 4 | 0 |
| 90 | medium | 10 | 0 | 5 | 0 |
| 9 | large | 8 | 1 | 23 | 0 |
| 9 | large | 19 | 0 | 2 | 0 |
| 1 | large | 1 | 0 | 0 | 0 |
| 50 | large | 34 | 0 | 7 | 0 |
This is the input for above table:
data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])
I want to have an output which will group the dataset on size and would list out top 2 id based on the values of 'new_a' column within each group of size. Since, some of the ids are repeating multiple times, I would want to sum the values of new_a for such ids and then find top 2 values. My final table should look like the one below:
| size | id | new_a |
|--------|----|-------|
| large | 9 | 25 |
| large | 50 | 7 |
| medium | 13 | 32 |
| medium | 90 | 25 |
| small | 6 | 44 |
| small | 10 | 12 |
I have tried the below code but it isn't showing top 2 values of new_a for each group within 'size' column.
nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
df.groupby('size').apply(
lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
).reset_index()[['size', 'id', 'new_a']]
)
Prints:
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can set size, id as the index to avoid double groupby here, and use Series.sum leveraging level parameter.
df.set_index(["size", "id"]).groupby(level=0).apply(
lambda x: x.sum(level=1).nlargest(2)
).reset_index()
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can chain two groupby methods:
data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()
Output:
id size new_a
0 9 large 25
1 50 large 7
2 13 medium 32
3 90 medium 25
4 6 small 44
5 10 small 12

Filter all rows from groupby object

I have a dataframe like below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 2 | 77 | 105 | 3 | 12 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 1 | 21 | 145 | 1 | 9 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I want to filter the entire group, if any of the items in the list item_list = [128,129,130] is present in that group, after grouping by 'InvoiceNo' &'CategoryNo'.
My desired out put is as below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I know how to filter a dataframe using isin(). But, not sure how to do it with groupby()
so far i have tried below
import pandas as pd
df = pd.read_csv('data.csv')
item_list = [128,129,130]
df.groupby(['InvoiceNo','CategoryNo'])['Item'].isin(item_list)
but nothing happens. please guide me how to solve this issue.
You can do something like this:
s = (df['Item'].isin(item_list)
.groupby([df['InvoiceNo'], df['CategoryNo']])
.transform('any')
)
df[s]

Filtering out rows that don't meet specific order and value criteria in Python or tSQL?

I need some help filtering rows out of a customer dataset I've created.
The dataset contains customer IDs, policy numbers, and the dates related to their policies. Customers can switch freely between policies, anytime they wish. The following dataset is all just an example dataset I put together. I can use either pandas or sql server to filter out the right customers.
Objective:
I want to filter the dataset and retrieve customers under the following conditions:
Customer must have chronologically been on Policy rate 13, then switched to 11.
Customers must have atleast 350 days on both policies.
I've included a column (policy_order) showing the order active policies. It doesn't matter when the 13 => 11 switch occurred, as long as the jump was from 13 to 11, and they spent 350 days on each.
| row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy |
|-----|---------|------------|--------------|------------|--------------|----------------|
| 1 | 1000 | 17 | 09/23/2013 | 11/05/2013 | 1 | 43 |
| 2 | 1200 | 13 | 08/26/2011 | 04/30/2019 | 1 | 2804 |
| 3 | 3400 | 13 | 08/31/2012 | 02/22/2015 | 1 | 905 |
| 4 | 5000 | 17 | 04/12/2014 | 07/28/2014 | 1 | 107 |
| 5 | 5000 | 13 | 07/28/2014 | 08/24/2016 | 2 | 758 |
| 6 | 5000 | 11 | 08/24/2016 | 10/20/2018 | 3 | 787 |
| 7 | 5000 | 13 | 10/20/2018 | 05/02/2019 | 4 | 194 |
| 8 | 7600 | 13 | 02/02/2015 | 05/03/2019 | 1 | 1551 |
| 9 | 4300 | 11 | 01/07/2015 | 05/04/2017 | 1 | 848 |
| 10 | 4300 | 13 | 05/04/2017 | 05/05/2019 | 2 | 731 |
| 11 | 9800 | 13 | 12/12/2001 | 10/06/2015 | 1 | 5046 |
| 12 | 9800 | 11 | 10/06/2015 | 05/06/2019 | 2 | 1308 |
As seen in the table above, two customers match the criteria. Customer 5000, and customer 9800. I used customer 5000 as an example, because they've switched policies multiple times but still meet the criteria in rows 5 and 6. These are the only rows I'm concerned with.
So the output that I would want to see would look like this:
| row | acct | policy_num | policy_start | policy_end | policy_order | days_on_policy |
|-----|------|------------|--------------|------------|--------------|----------------|
| 1 | 5000 | 13 | 7/28/2014 | 8/24/2016 | 2 | 758 |
| 2 | 5000 | 11 | 8/24/2016 | 10/20/2018 | 3 | 787 |
| 3 | 9800 | 13 | 12/12/2001 | 10/6/2015 | 1 | 5046 |
| 4 | 9800 | 11 | 10/6/2015 | 5/6/2019 | 2 | 1308 |
The results would show the customer ID, the correct policy numbers, relevant dates, and how many days they were on each policy.
I've tried filtering using the WHERE clause in SQL (which I'm admittedly bad at), but haven't even come close to an answer - and don't even really know where to start.
My main goal is to try and get the rows filtered using order, policy number, and days on policy.
Any and all help is greatly appreciated!
If you want a solution based on Pandas, then define the following
filtering function:
def fltr(gr):
wrk = gr.query('policy_num in [11, 13]').sort_values(['policy_order'])
pNum = wrk.set_index('policy_order').policy_num
if ~((pNum == 11).any() and (pNum == 13).any()):
return None
ind11 = pNum[pNum == 11].index[0]
ind13 = pNum[pNum == 13].index[0]
if ind13 > ind11:
return None
if (wrk.groupby('policy_num').days_on_policy.sum() >= 350).all():
return wrk.drop_duplicates(subset='policy_num')
return None
Then use it in groupby:
df.groupby('cust_id').apply(fltr)
A short description of the filtering function
It starts with computing auxiliary variables:
wrk - rows of the current group for policy_num == either 11
or 13, ordered by policy_order.
pNum - policy_num column from wrk, indexed by policy_order.
The filtering function has 2 "initial" occasions to return the empty content
(None), to reject the current group:
pNum failed to contain at least one 11 and at least one 13.
The index (actually policy_order) of the first 13 element in pNum
is greater than the index of the first 11 element (policy 13
follows policy 11).
The last decision is based on a question: Does each of the policies
in question (11 and 13) have the sum of days_on_policy >= 350?
If yes, the function returns rows from wrk without repetitions,
to drop possible last 13 (as in the case of group 5000).
Otherwise, the current group is also rejected.
Here is what I guess you would need.
SELECT *
FROM policy p1
WHERE policy_num = 13
AND days_on_policy >= 350
AND EXISTS
(SELECT 1 FROM policy p2
WHERE p1.cust_id = p2.cust_id
AND p2.policy_num =11
AND p2.policy_start >= p1.policy_end
AND p2.days_on_policy >= 350)
UNION ALL
SELECT *
FROM policy p1
where policy_num = 11
AND days_on_policy >= 350
AND EXISTS
(SELECT 1 FROM policy p2
WHERE p1.cust_id = p2.cust_id
AND p2.policy_num =13
AND p1.policy_start >= p2.policy_end
AND p2.days_on_policy >= 350)
SQLFiddler
It is nearly always better to do the filtering of data within the query, unless performance of the database is affected by the query.
If your dataset isn't to large this is the procedure I would use to filter.
#filter on the criteria for the policy number
df_13_fltr = df[(df['policy_num']==13)&\
(df['days_on_policy']>=350)][['row','cust_id','policy_end']]
df_11_fltr = df[(df['policy_num']==11)&\
(df['days_on_policy']>=350)][['row','cust_id','policy_start']]
#merge the 2 filtered DataFrames together and compare the policy_end and policy_start
df_fltr = df_11_fltr.merge(df_13_fltr, on='cust_id',how='inner',suffixes=('13','11'))
df_fltr =df_fltr[df_fltr['policy_end']<=df_fltr['policy_start']][['row13','row11']]
#put the rows in a list
rows = list(df_fltr['row13'].values)+list(df_fltr['row11'])
#using the rows list in a lambda filter on the original dataset
df[df['row'].apply(lambda x: x in rows)]
With a self join and the conditions applied on the ON clause:
select t1.*
from tablename t1 inner join tablename t2
on
t2.cust_id = t1.cust_id
and (
(t2.policy_start = t1.policy_end) and (t1.policy_num = 13 and t2.policy_num = 11)
or
(t1.policy_start = t2.policy_end) and (t2.policy_num = 13 and t1.policy_num = 11)
)
and t1.days_on_policy >= 350 and t2.days_on_policy >= 350
order by t1.cust_id, t1.policy_start
See the demo.
Results:
> row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy
> --: | ------: | ---------: | :------------------ | :------------------ | -----------: | -------------:
> 5 | 5000 | 13 | 28/07/2014 00:00:00 | 24/08/2016 00:00:00 | 2 | 758
> 6 | 5000 | 11 | 24/08/2016 00:00:00 | 20/10/2018 00:00:00 | 3 | 787
> 11 | 9800 | 13 | 12/12/2001 00:00:00 | 06/10/2015 00:00:00 | 1 | 5046
> 12 | 9800 | 11 | 06/10/2015 00:00:00 | 06/05/2019 00:00:00 | 2 | 1308
I used a groupby on cust_id and a rolling window to look back on the policy_num to find 11 current and 13 previous. I originally thought to create a filter on 350 days but commented it out because it could break the sequence of policy_num
data = """
| row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy |
| 1 | 1000 | 17 | 09/23/2013 | 11/05/2013 | 1 | 43 |
| 2 | 1200 | 13 | 08/26/2011 | 04/30/2019 | 1 | 2804 |
| 3 | 3400 | 13 | 08/31/2012 | 02/22/2015 | 1 | 905 |
| 4 | 5000 | 17 | 04/12/2014 | 07/28/2014 | 1 | 107 |
| 5 | 5000 | 13 | 07/28/2014 | 08/24/2016 | 2 | 758 |
| 6 | 5000 | 11 | 08/24/2016 | 10/20/2018 | 3 | 787 |
| 7 | 5000 | 13 | 10/20/2018 | 05/02/2019 | 4 | 194 |
| 8 | 7600 | 13 | 02/02/2015 | 05/03/2019 | 1 | 1551 |
| 9 | 4300 | 11 | 01/07/2015 | 05/04/2017 | 1 | 848 |
| 10 | 4300 | 13 | 05/04/2017 | 05/05/2019 | 2 | 731 |
| 11 | 9800 | 13 | 12/12/2001 | 10/06/2015 | 1 | 5046 |
| 12 | 9800 | 11 | 10/06/2015 | 05/06/2019 | 2 | 1308 |
"""
data = data.strip().split('\n')
data = [i.strip().split('|') for i in data]
data = [i[1:-1] for i in data]
columns=[data.strip() for data in data[0]]
df = pd.DataFrame(data[1:], columns=columns)
print(df.columns)
df.set_index(['row'],inplace=True)
# set the datatypes for each column
df['cust_id'] = df['cust_id'].astype(int)
df['policy_num'] = df['policy_num'].astype(int)
df['policy_start'] = pd.to_datetime(df['policy_start'])
df['policy_end'] = pd.to_datetime(df['policy_end'])
df['policy_order'] = df['policy_order'].astype(int)
df['days_on_policy'] = df['days_on_policy'].astype(int)
#print(df)
def create_filter(df, filter_cols, filter_values,operator_values):
filter_list = []
for col, val,operator in zip(filter_cols, filter_values,operator_values):
if operator=='>':
filter_list.append(df[col] > val)
elif operator=='>=':
filter_list.append(df[col] >= val)
elif operator=='<':
filter_list.append(df[col] < val)
elif operator=='<=':
filter_list.append(df[col] <= val)
elif operator=='==':
filter_list.append(df[col] == val)
return pd.concat(filter_list, axis=1).all(axis=1)
#filter_cols=['days_on_policy']
#filter_values=[350]
#operator_values=['>']
#filter=create_filter(df, filter_cols, filter_values,operator_values)
#df=df[filter]
df = df.sort_values(by=['cust_id','policy_order'], ascending=False)
#print(df)
df_grouped = df.groupby('cust_id')
rolling_df=df_grouped.rolling(window=1).sum()
prev_key,prev_policy_num,prev_days_on_policy=tuple(),"",""
prev_key=None
for key,item in rolling_df.iterrows():
policy_num=item['policy_num']
days_on_policy=item['days_on_policy']
if prev_key!=None:
prev_policy_num,prev_days_on_policy=rolling_df.loc[prev_key] [['policy_num','days_on_policy']]
if key[0]==prev_key[0] and policy_num==13 and prev_policy_num==11 and prev_days_on_policy>350 and days_on_policy>350:
print(prev_key[0],prev_policy_num)
prev_key=key
output:
5000 11.0
9800 11.0

Categories