Pandas melt multiple groups into single column - python

Original DataFrame:
+----+----------+----------+----------+----------+
| ID | var1hrs | var2hrs | ind1var | ind2var |
+----+----------+----------+----------+----------+
| 1 | 55 | 45 | 123 | 456 |
| 2 | 48 | 60 | 331 | 222 |
+----+----------+----------+----------+----------+
Target DataFrame:
+----+------------+------+------+
| ID | type | hrs | ind |
+----+------------+------+------+
| 1 | primary | 55 | 123 |
| 1 | secondary | 45 | 456 |
| 2 | primary | 48 | 331 |
| 2 | secondary | 60 | 222 |
+----+------------+------+------+
How would I go about melting multiple groups of variables into a single label column? The "1" in the variable names indicate type = "primary" and "2" indicates type = "secondary".

After modify the columns' name, we can using wide_to_long
df.columns=df.columns.str[:4]
s=pd.wide_to_long(df,['var','ind'],i='ID',j='type').reset_index()
s=s.assign(type=s.type.map({'1':'primary','2':'secondary'})).sort_values('ID')
s
ID type var ind
0 1 primary 55 123
2 1 secondary 45 456
1 2 primary 48 331
3 2 secondary 60 222

(Comments inlined)
# set ID as the index and sort columns
df = df.set_index('ID').sort_index(axis=1)
# extract primary columns
prim = df.filter(like='1')
prim.columns = ['ind', 'vars']
# extract secondary columns
sec = df.filter(like='2')
sec.columns = ['ind', 'vars']
# concatenation + housekeeping
v = (pd.concat([prim, sec], keys=['primary', 'secondary'])
.swaplevel(0, 1)
.rename_axis(['ID', 'type'])
.reset_index()
)
print(v)
ID type ind vars
0 1 primary 123 55
1 2 primary 331 48
2 1 secondary 456 45
3 2 secondary 222 60
This is more or less one efficient way of doing it, even if the steps are a bit involved.

Related

How to create new columns with name of columns in list with the highest value per ID, metioned after coma if need in Python Pandas?

I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 0 | 0 | 110 | 350
444 | 20 | 5 | 0 | 0
555 | 0 | 0 | 0 | 0
666 | 10 | 20 | 60 | 50
Requirements:
I need to create new column "TOP_COUNT_2" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
if some ID has same values in all "COUNT_" columns take to "TOP_COUNT_2" all columns names with prefix "COUNT_" mentioned after the decimal point
I need to create new column "TOP_SUM_2" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
if some ID has same values in all "SUM_" columns take to "TOP_SUM_2" all columns names with prefix "COUNT_" mentioned after the decimal point
If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT
If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM
Desire output:
ID | CONT_COL_A | CNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT_2 | TOP_SUM_2
-----|-------------|-------------|-----------|------------|----------------------|-----------
111 | 10 | 10 | 320 | 120 | CNT_COL_A, CNT_COL_B | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_A, SUM_COL_B
333 | 0 | 0 | 110 | 350 | NaN | SUM_COL_B
444 | 20 | 5 | 0 | 0 | COUNT_COL_A | NaN
555 | 0 | 0 | 0 | 0 | NaN | NaN
666 | 10 | 20 | 60 | 50 | COUNT_COL_B | SUM_COL_A
How can i do that in Python Pandas ?
First create mask for processing only non 0 only rows by DataFrame.any with Series.ne with boolean indexing and DataFrame.loc, then compare by maximal values and for join column names use DataFrame.dot trick with columns names with separator:
cols1 = ['COUNT_COL_A' , 'COUNT_COL_B']
cols2 = ['SUM_COL_A','SUM_COL_B']
m1 = df[cols1].ne(0).any(axis=1)
m2 = df[cols2].ne(0).any(axis=1)
df1 = df.loc[m1, cols1]
df2 = df.loc[m2, cols2]
df['TOP_COUNT_2'] = df1.eq(df1.max(axis=1), axis=0).dot(df1.columns + ', ').str[:-2]
df['TOP_SUM_2'] = df2.eq(df2.max(axis=1), axis=0).dot(df2.columns + ', ').str[:-2]
print (df)
ID COUNT_COL_A COUNT_COL_B SUM_COL_A SUM_COL_B \
0 111 10 10 320 120
1 222 15 80 500 500
2 333 0 0 110 350
3 444 20 5 0 0
4 555 0 0 0 0
5 666 10 20 60 50
TOP_COUNT_2 TOP_SUM_2
0 COUNT_COL_A, COUNT_COL_B SUM_COL_A
1 COUNT_COL_B SUM_COL_A, SUM_COL_B
2 NaN SUM_COL_B
3 COUNT_COL_A NaN
4 NaN NaN
5 COUNT_COL_B SUM_COL_A

Match and Store Python Dataframe

I'm trying to figure out a way to match a given value in a dataframe column to another dataframe column, and then storing an AGE from df1 in df2.
e.g. Matching VAL in df1 to VAL in df2. If the two are equal, store AGE from df1 in AGE df2.
| df1 | VAL | AGE |
|:--- |:---:|----:|
| 0 | 20 | 25 |
| 1 | 10 | 29 |
| 2 | 50 | 21 |
| 4 | 20 | 32 |
| 5 | 00 | 19 |
| df2 | VAL | AGE |
|:--- |:---:|----:|
| 0 | 00 | [] |
| 1 | 10 | [] |
| 2 | 20 | [] |
| 4 | 30 | [] |
| 5 | 40 | [] |
| 6 | 50 | [] |
edit: AGE in df2 stores an array of values rather than a single value
Try:
x = df1.groupby("VAL").agg(list)
df2["AGE"] = df2["VAL"].map(x["AGE"]).fillna({i: [] for i in df2.index})
print(df2)
Prints:
VAL AGE
0 0 [19]
1 10 [29]
2 20 [25, 32]
4 30 []
5 40 []
6 50 [21]

Fetch values corresponding to id of each row python

Is is possible to fetch column containing values corresponding to an id column?
Example:-
df1
| ID | Value | Salary |
|:---------:--------:|:------:|
| 1 | amr | 34 |
| 1 | ith | 67 |
| 2 | oaa | 45 |
| 1 | eea | 78 |
| 3 | anik | 56 |
| 4 | mmkk | 99 |
| 5 | sh_s | 98 |
| 5 | ahhi | 77 |
df2
| ID | Dept |
|:---------:--------:|
| 1 | hrs |
| 1 | cse |
| 2 | me |
| 1 | ece |
| 3 | eee |
Expected Output
| ID | Dept | Value |
|:---------:--------:|----------:|
| 1 | hrs | amr |
| 1 | cse | ith |
| 2 | me | oaa |
| 1 | ece | eea |
| 3 | eee | anik |
I want to fetch each values in the 'Value' column corresponding to values in df2's ID column. And create column containing 'Values' in df2. The number of rows in the two dfs are not the same. I have tried
this
Not worked
IIUC , you can try df.merge after assigning a helper column by doing groupby+cumcount on ID:
out = (df1.assign(k=df1.groupby("ID").cumcount())
.merge(df2.assign(k=df2.groupby("ID").cumcount()),on=['ID','k'])
.drop("k",1))
print(out)
ID Value Dept
0 1 Amr hrs
1 1 ith cse
2 2 oaa me
3 1 eea ece
4 3 anik eee
is this what you want to do?
df1.merge(df2, how='inner',on ='ID')
Since you have duplicated IDs in both dfs, but these are ordered, try:
df1 = df1.drop(columns="ID")
df3 = df2.merge(df1, left_index=True, right_index=True)

Filtering out rows that don't meet specific order and value criteria in Python or tSQL?

I need some help filtering rows out of a customer dataset I've created.
The dataset contains customer IDs, policy numbers, and the dates related to their policies. Customers can switch freely between policies, anytime they wish. The following dataset is all just an example dataset I put together. I can use either pandas or sql server to filter out the right customers.
Objective:
I want to filter the dataset and retrieve customers under the following conditions:
Customer must have chronologically been on Policy rate 13, then switched to 11.
Customers must have atleast 350 days on both policies.
I've included a column (policy_order) showing the order active policies. It doesn't matter when the 13 => 11 switch occurred, as long as the jump was from 13 to 11, and they spent 350 days on each.
| row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy |
|-----|---------|------------|--------------|------------|--------------|----------------|
| 1 | 1000 | 17 | 09/23/2013 | 11/05/2013 | 1 | 43 |
| 2 | 1200 | 13 | 08/26/2011 | 04/30/2019 | 1 | 2804 |
| 3 | 3400 | 13 | 08/31/2012 | 02/22/2015 | 1 | 905 |
| 4 | 5000 | 17 | 04/12/2014 | 07/28/2014 | 1 | 107 |
| 5 | 5000 | 13 | 07/28/2014 | 08/24/2016 | 2 | 758 |
| 6 | 5000 | 11 | 08/24/2016 | 10/20/2018 | 3 | 787 |
| 7 | 5000 | 13 | 10/20/2018 | 05/02/2019 | 4 | 194 |
| 8 | 7600 | 13 | 02/02/2015 | 05/03/2019 | 1 | 1551 |
| 9 | 4300 | 11 | 01/07/2015 | 05/04/2017 | 1 | 848 |
| 10 | 4300 | 13 | 05/04/2017 | 05/05/2019 | 2 | 731 |
| 11 | 9800 | 13 | 12/12/2001 | 10/06/2015 | 1 | 5046 |
| 12 | 9800 | 11 | 10/06/2015 | 05/06/2019 | 2 | 1308 |
As seen in the table above, two customers match the criteria. Customer 5000, and customer 9800. I used customer 5000 as an example, because they've switched policies multiple times but still meet the criteria in rows 5 and 6. These are the only rows I'm concerned with.
So the output that I would want to see would look like this:
| row | acct | policy_num | policy_start | policy_end | policy_order | days_on_policy |
|-----|------|------------|--------------|------------|--------------|----------------|
| 1 | 5000 | 13 | 7/28/2014 | 8/24/2016 | 2 | 758 |
| 2 | 5000 | 11 | 8/24/2016 | 10/20/2018 | 3 | 787 |
| 3 | 9800 | 13 | 12/12/2001 | 10/6/2015 | 1 | 5046 |
| 4 | 9800 | 11 | 10/6/2015 | 5/6/2019 | 2 | 1308 |
The results would show the customer ID, the correct policy numbers, relevant dates, and how many days they were on each policy.
I've tried filtering using the WHERE clause in SQL (which I'm admittedly bad at), but haven't even come close to an answer - and don't even really know where to start.
My main goal is to try and get the rows filtered using order, policy number, and days on policy.
Any and all help is greatly appreciated!
If you want a solution based on Pandas, then define the following
filtering function:
def fltr(gr):
wrk = gr.query('policy_num in [11, 13]').sort_values(['policy_order'])
pNum = wrk.set_index('policy_order').policy_num
if ~((pNum == 11).any() and (pNum == 13).any()):
return None
ind11 = pNum[pNum == 11].index[0]
ind13 = pNum[pNum == 13].index[0]
if ind13 > ind11:
return None
if (wrk.groupby('policy_num').days_on_policy.sum() >= 350).all():
return wrk.drop_duplicates(subset='policy_num')
return None
Then use it in groupby:
df.groupby('cust_id').apply(fltr)
A short description of the filtering function
It starts with computing auxiliary variables:
wrk - rows of the current group for policy_num == either 11
or 13, ordered by policy_order.
pNum - policy_num column from wrk, indexed by policy_order.
The filtering function has 2 "initial" occasions to return the empty content
(None), to reject the current group:
pNum failed to contain at least one 11 and at least one 13.
The index (actually policy_order) of the first 13 element in pNum
is greater than the index of the first 11 element (policy 13
follows policy 11).
The last decision is based on a question: Does each of the policies
in question (11 and 13) have the sum of days_on_policy >= 350?
If yes, the function returns rows from wrk without repetitions,
to drop possible last 13 (as in the case of group 5000).
Otherwise, the current group is also rejected.
Here is what I guess you would need.
SELECT *
FROM policy p1
WHERE policy_num = 13
AND days_on_policy >= 350
AND EXISTS
(SELECT 1 FROM policy p2
WHERE p1.cust_id = p2.cust_id
AND p2.policy_num =11
AND p2.policy_start >= p1.policy_end
AND p2.days_on_policy >= 350)
UNION ALL
SELECT *
FROM policy p1
where policy_num = 11
AND days_on_policy >= 350
AND EXISTS
(SELECT 1 FROM policy p2
WHERE p1.cust_id = p2.cust_id
AND p2.policy_num =13
AND p1.policy_start >= p2.policy_end
AND p2.days_on_policy >= 350)
SQLFiddler
It is nearly always better to do the filtering of data within the query, unless performance of the database is affected by the query.
If your dataset isn't to large this is the procedure I would use to filter.
#filter on the criteria for the policy number
df_13_fltr = df[(df['policy_num']==13)&\
(df['days_on_policy']>=350)][['row','cust_id','policy_end']]
df_11_fltr = df[(df['policy_num']==11)&\
(df['days_on_policy']>=350)][['row','cust_id','policy_start']]
#merge the 2 filtered DataFrames together and compare the policy_end and policy_start
df_fltr = df_11_fltr.merge(df_13_fltr, on='cust_id',how='inner',suffixes=('13','11'))
df_fltr =df_fltr[df_fltr['policy_end']<=df_fltr['policy_start']][['row13','row11']]
#put the rows in a list
rows = list(df_fltr['row13'].values)+list(df_fltr['row11'])
#using the rows list in a lambda filter on the original dataset
df[df['row'].apply(lambda x: x in rows)]
With a self join and the conditions applied on the ON clause:
select t1.*
from tablename t1 inner join tablename t2
on
t2.cust_id = t1.cust_id
and (
(t2.policy_start = t1.policy_end) and (t1.policy_num = 13 and t2.policy_num = 11)
or
(t1.policy_start = t2.policy_end) and (t2.policy_num = 13 and t1.policy_num = 11)
)
and t1.days_on_policy >= 350 and t2.days_on_policy >= 350
order by t1.cust_id, t1.policy_start
See the demo.
Results:
> row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy
> --: | ------: | ---------: | :------------------ | :------------------ | -----------: | -------------:
> 5 | 5000 | 13 | 28/07/2014 00:00:00 | 24/08/2016 00:00:00 | 2 | 758
> 6 | 5000 | 11 | 24/08/2016 00:00:00 | 20/10/2018 00:00:00 | 3 | 787
> 11 | 9800 | 13 | 12/12/2001 00:00:00 | 06/10/2015 00:00:00 | 1 | 5046
> 12 | 9800 | 11 | 06/10/2015 00:00:00 | 06/05/2019 00:00:00 | 2 | 1308
I used a groupby on cust_id and a rolling window to look back on the policy_num to find 11 current and 13 previous. I originally thought to create a filter on 350 days but commented it out because it could break the sequence of policy_num
data = """
| row | cust_id | policy_num | policy_start | policy_end | policy_order | days_on_policy |
| 1 | 1000 | 17 | 09/23/2013 | 11/05/2013 | 1 | 43 |
| 2 | 1200 | 13 | 08/26/2011 | 04/30/2019 | 1 | 2804 |
| 3 | 3400 | 13 | 08/31/2012 | 02/22/2015 | 1 | 905 |
| 4 | 5000 | 17 | 04/12/2014 | 07/28/2014 | 1 | 107 |
| 5 | 5000 | 13 | 07/28/2014 | 08/24/2016 | 2 | 758 |
| 6 | 5000 | 11 | 08/24/2016 | 10/20/2018 | 3 | 787 |
| 7 | 5000 | 13 | 10/20/2018 | 05/02/2019 | 4 | 194 |
| 8 | 7600 | 13 | 02/02/2015 | 05/03/2019 | 1 | 1551 |
| 9 | 4300 | 11 | 01/07/2015 | 05/04/2017 | 1 | 848 |
| 10 | 4300 | 13 | 05/04/2017 | 05/05/2019 | 2 | 731 |
| 11 | 9800 | 13 | 12/12/2001 | 10/06/2015 | 1 | 5046 |
| 12 | 9800 | 11 | 10/06/2015 | 05/06/2019 | 2 | 1308 |
"""
data = data.strip().split('\n')
data = [i.strip().split('|') for i in data]
data = [i[1:-1] for i in data]
columns=[data.strip() for data in data[0]]
df = pd.DataFrame(data[1:], columns=columns)
print(df.columns)
df.set_index(['row'],inplace=True)
# set the datatypes for each column
df['cust_id'] = df['cust_id'].astype(int)
df['policy_num'] = df['policy_num'].astype(int)
df['policy_start'] = pd.to_datetime(df['policy_start'])
df['policy_end'] = pd.to_datetime(df['policy_end'])
df['policy_order'] = df['policy_order'].astype(int)
df['days_on_policy'] = df['days_on_policy'].astype(int)
#print(df)
def create_filter(df, filter_cols, filter_values,operator_values):
filter_list = []
for col, val,operator in zip(filter_cols, filter_values,operator_values):
if operator=='>':
filter_list.append(df[col] > val)
elif operator=='>=':
filter_list.append(df[col] >= val)
elif operator=='<':
filter_list.append(df[col] < val)
elif operator=='<=':
filter_list.append(df[col] <= val)
elif operator=='==':
filter_list.append(df[col] == val)
return pd.concat(filter_list, axis=1).all(axis=1)
#filter_cols=['days_on_policy']
#filter_values=[350]
#operator_values=['>']
#filter=create_filter(df, filter_cols, filter_values,operator_values)
#df=df[filter]
df = df.sort_values(by=['cust_id','policy_order'], ascending=False)
#print(df)
df_grouped = df.groupby('cust_id')
rolling_df=df_grouped.rolling(window=1).sum()
prev_key,prev_policy_num,prev_days_on_policy=tuple(),"",""
prev_key=None
for key,item in rolling_df.iterrows():
policy_num=item['policy_num']
days_on_policy=item['days_on_policy']
if prev_key!=None:
prev_policy_num,prev_days_on_policy=rolling_df.loc[prev_key] [['policy_num','days_on_policy']]
if key[0]==prev_key[0] and policy_num==13 and prev_policy_num==11 and prev_days_on_policy>350 and days_on_policy>350:
print(prev_key[0],prev_policy_num)
prev_key=key
output:
5000 11.0
9800 11.0

Need to aggregate count(rowid, colid) on dataframe in pandas

I've been trying to turn this
| row_id | col_id |
|--------|--------|
| 1 | 23 |
| 4 | 45 |
| ... | ... |
| 1 | 23 |
| ... | ... |
| 4 | 45 |
| ... | ... |
| 4 | 45 |
| ... | ... |
Into this
| row_id | col_id | count |
|--------|--------|---------|
| 1 | 23 | 2 |
| 4 | 45 | 3 |
| ... | ... | ... |
So all (row_i, col_j) occurrences are added into the 'count' column. Note that row_id and column_id won't be unique in any of both cases.
Now success until now, at least if I want to keep being efficient. I can iterate over each pair and add up occurrences, but there has to be a simpler way in pandas—or numpy for that matter.
Thanks!
EDIT 1:
As #j-bradley suggested, I tried the following
# I use django-pandas
rdf = Record.objects.to_dataframe(['row_id', 'column_id'])
_ = rdf.groupby(['row_id', 'column_id'])['row_id'].count().head(20)
_.head(10)
And that outputs
row_id column_id
1 108 1
168 1
218 1
398 2
422 1
10 35 2
355 1
489 1
100 352 1
366 1
Name: row_id, dtype: int64
This seems ok. But it's a Series object and I'm not sure how to turn this into a dataframe with the required three columns. Pandas noob, as it seems. Any tips?
Thanks again.
you can group by columns a and b and call count on the group by object:
df =pd.DataFrame({'A':[1,4,1,4,4], 'B':[23,45,23,45,45]})
df.groupby(['A','B'])['A'].count()
returns:
A B
1 23 2
4 45 3
Edited to make the answer more explicit
To turn the series back to a dataframe with a column named count:
_ = df.groupby(['A','B'])['A'].count()
the name of the series becomes the column name:
_.name = 'Count'
resetting the index, promotes the multi-index to columns and turns the series into a dataframe:
df =_.reset_index()

Categories