Remove rows from two DFs that have uncommon column value

Remove rows from two DFs that have uncommon column value - python

I have these two DFs
Active:
Customer_ID | product_No| Rating
7 | 111 | 3.0
7 | 222 | 1.0
7 | 333 | 5.0
7 | 444 | 3.0
User:
Customer_ID | product_No| Rating
9 | 111 | 2.0
9 | 222 | 5.0
9 | 666 | 5.0
9 | 555 | 3.0
I want to find the ratings of the common products that both users rated (e.g. 111,222) and remove any uncommon products (e.g. 444,333,555,666). So the new DFs should be like this:
Active:
Customer_ID | product_No| Rating
7 | 111 | 3.0
7 | 222 | 1.0
User:
Customer_ID | product_No| Rating
9 | 111 | 2.0
9 | 222 | 5.0
I do not know how to do this without for loops. Can you help me, please
This is the code I have so far:
import pandas as pd
ratings = pd.read_csv("ratings.csv",names['Customer_ID','product_No','Rating'])
active=ratings[ratings['UserID']==7]
user=ratings[ratings['UserID']==9]

You can firstly get the common product_No using set intersection and then use isin method to filter on the original data frames:
common_product = set(active.product_No).intersection(user.product_No)
common_product
# {111, 222}
active[active.product_No.isin(common_product)]
#Customer_ID product_No Rating
#0 7 111 3.0
#1 7 222 1.0
user[user.product_No.isin(common_product)]
#Customer_ID product_No Rating
#0 9 111 2.0
#1 9 222 5.0

Use query referencing the other dataframes
Active.query('product_No in #User.product_No')
Customer_ID product_No Rating
0 7 111 3.0
1 7 222 1.0
User.query('product_No in #Active.product_No')
Customer_ID product_No Rating
0 9 111 2.0
1 9 222 5.0

I tried this using INNER JOIN as follows:
import pandas as pd
df1 = pd.read_csv('a.csv')
df2 = pd.read_csv('b.csv')
print df1
print df2
df_ij = pd.merge(df1, df2, on='product_No', how='inner')
print df_ij
df_list = []
for df_e,suffx in zip([df1,df2],['_x','_y']):
df_e = df_ij[['Customer_ID'+suffx,'product_No','Rating'+suffx]]
df_e.columns = list(df1)
df_list.append(df_e)
print df_list[0]
print df_list[1]
It gives the following output:
# print df1
Customer_ID product_No Rating
0 7 111 3
1 7 222 1
2 7 333 5
3 7 444 3
# print df2
Customer_ID product_No Rating
0 9 111 2
1 9 222 5
2 9 777 5
3 9 555 3
# print the INNER JOINed df
Customer_ID_x product_No Rating_x Customer_ID_y Rating_y
0 7 111 3 9 2
1 7 222 1 9 5
# print the first df you want, with common 'product_No'
Customer_ID product_No Rating
0 7 111 3
1 7 222 1
# print the second df you want, with common 'product_No'
Customer_ID product_No Rating
0 9 111 2
1 9 222 5
The inner join selects the common rows in each df. Since there are common column names, for columns not used in the join, the joined df has added suffixes to distinguish between those column names. You then need to simply extract columns to get your required final result, by just specifying the appropriate suffix.
There is a nice example of INNER JOIN here.

Your Answer for this question is....
import pandas as pd
dict1={"Customer_id":[7,7,7,7],
"Product_No":[111,222,333,444],
"rating":[3.0,1.0,5.0,3.0]}
active=pd.DataFrame(dict1)
dict2={"Customer_id":[9,9,9,9],
"Product_No":[111,222,666,555],
"rating":[2.0,5.0,5.0,3.0]}
user=pd.DataFrame(dict2)
df3=pd.merge(active,user,on="Product_No",how="inner")
df3
active=df3[["Customer_id_x","Product_No","rating_x"]]
print(active)
user=df3[["Customer_id_y","Product_No","rating_y"]]
print(user)

Related

python piecewise linear interpolation across dataframes in a list

I am trying to apply piecewise linear interpolation. I first tried to use pandas built-in interpolate function but it was not working.
Example data looks below
import pandas as pd
import numpy as np
d = {'ID':[5,5,5,5,5,5,5], 'month':[0,3,6,9,12,15,18], 'num':[7,np.nan,5,np.nan,np.nan,5,8]}
tempo = pd.DataFrame(data = d)
d2 = {'ID':[6,6,6,6,6,6,6], 'month':[0,3,6,9,12,15,18], 'num':[5,np.nan,2,np.nan,np.nan,np.nan,7]}
tempo2 = pd.DataFrame(data = d2)
this = []
this.append(tempo)
this.append(tempo2)
The actual data has over 1000 unique IDs, so I filtered each ID into a dataframe and put them into the list.
The first dataframe in the list looks as below
I am trying to go through all the dataframe in the list to do a piecewise linear interpolation. I tried to change month to a index and use .interpolate(method='index', inplace = True) but it was not working.
The expected output is
ID | month | num
5 | 0 | 7
5 | 3 | 6
5 | 6 | 5
5 | 9 | 5
5 | 12 | 5
5 | 15 | 5
5 | 18 | 8
This needs to be applied across all the dataframes in the list.

Assuming this is a follow up of your previous question, change the code to:
for i, df in enumerate(this):
this[i] = (df
.set_index('month')
# optional, because of the previous question
.reindex(range(df['month'].min(), df['month'].max()+3, 3))
.interpolate()
.reset_index()[df.columns]
)
NB. I simplified the code to remove the groupby, which only works if you have a single group per DataFrame, as you mentioned in the other question.
Output:
[ ID month num
0 5 0 7.0
1 5 3 6.0
2 5 6 5.0
3 5 9 5.0
4 5 12 5.0
5 5 15 5.0
6 5 18 8.0,
ID month num
0 6 0 5.00
1 6 3 3.50
2 6 6 2.00
3 6 9 3.25
4 6 12 4.50
5 6 15 5.75
6 6 18 7.00]

Pandas combining sparse columns in dataframe

I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?

You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0

Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32

Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0

This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]

Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0

pandas number of items in one column per value in another column

I have two dataframes. say for example, frame 1 is the student info:
student_id course
1 a
2 b
3 c
4 a
5 f
6 f
frame 2 is each interaction the student has with a program
student_id day number_of_clicks
1 4 60
1 5 34
1 7 87
2 3 33
2 4 29
2 8 213
2 9 46
3 2 103
I am trying to add the information from frame 2 to frame 1, ie. for each student I would like to know the number of different days they accessed the database on, and the sum of all the clicks on those days. eg:
student_id course no_days total_clicks
1 a 3 181
2 b 4 321
3 c 1 103
4 a 0 0
5 f 0 0
6 f 0 0
I've tried to do this with groupby, but I couldn't add the information back into frame 1, or figure out how to sum the number of clicks. any ideas?

First we aggregate your df2 to the desired information using GroupBy.agg. Then we merge that information into df1:
agg = df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
)
df1 = df1.merge(agg, on='student_id', how='left').fillna(0)
student_id course no_days total_clicks
0 1 a 3.0 181.0
1 2 b 4.0 321.0
2 3 c 1.0 103.0
3 4 a 0.0 0.0
4 5 f 0.0 0.0
5 6 f 0.0 0.0
Or if you like one-liners, here's the same method as above, but in one line of code and more in SQL kind of style:
df1.merge(
df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
),
on='student_id',
how='left'
).fillna(0)

Use merge and fillna the null values then aggregate using groupby.agg as:
df = df1.merge(df2, how='left').fillna(0, downcast='infer')\
.groupby(['student_id', 'course'], as_index=False)\
.agg({'day':np.count_nonzero, 'number_of_clicks':np.sum}).reset_index()
print(df)
student_id course day number_of_clicks
0 1 a 3 181
1 2 b 4 321
2 3 c 1 103
3 4 a 0 0
4 5 f 0 0
5 6 f 0 0

Pandas - Rolling average for a group across multiple columns; large dataframe

I have the following dataframe:
-----+-----+-------------+-------------+-------------------------+
| ID1 | ID2 | Box1_weight | Box2_weight | Average Prev Weight ID1 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 2 | - |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 0 | 2 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 1 | (0 + 3 )/2=1.5 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 7 | 0 | (3+0+3)/3=2 |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 3 | (0+1+1)/3=0.6 |
I want to work out the moving average of weight the past 3 boxes, based on ID. I want to do this for all IDs in ID1.
I have put the column I want to calculate, along with the calculations is in the table above, labelled "Average Prev Weight ID1"
I can get a a rolling average for each individual column using the following:
df_copy.groupby('ID1')['Box1_weight'].apply(lambda x: x.shift().rolling(period_length, min_periods=1).mean())
However, this does not take into account that the item may also have been packed in the column labelled "Box2_weight"
How can I get a rolling average that is per ID, across the two columns?
Any guidance is appreciated.

Here is my attempt
stack the 2 ids and 2 weights columns to create dataframe with 1 ids and 1 weight column. Calculate the running average and assign back the running average for ID1 back to the dataframe
I have used your code of calculating rolling average but I arranged data to df2 before doing ti
import pandas as pd
d = {
"ID1": [19,677,19,19,677],
"ID2": [677, 19, 677,677, 19],
"Box1_weight": [3,1,3,7,1],
"Box2_weight": [2,0,1,0,3]
}
df = pd.DataFrame(d)
display(df)
period_length=3
ids = df[["ID1", "ID2"]].stack().values
weights = df[["Box1_weight", "Box2_weight"]].stack().values
df2=pd.DataFrame(dict(ids=ids, weights=weights))
rolling_avg = df2.groupby("ids")["weights"] \
.apply(lambda x: x.shift().rolling(period_length, min_periods=1)
.mean()).values.reshape(-1,2)
df["rolling_avg"] = rolling_avg[:,0]
display(df)
Result
ID1 ID2 Box1_weight Box2_weight
0 19 677 3 2
1 677 19 1 0
2 19 677 3 1
3 19 677 7 0
4 677 19 1 3
ID1 ID2 Box1_weight Box2_weight rolling_avg
0 19 677 3 2 NaN
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667

Not sure if this is what you want. I had trouble understanding your requirements. But here's a go:
ids = ['ID1', 'ID2']
ind = np.argsort(df[ids].to_numpy(), 1)
make_sort = lambda s, ind: np.take_along_axis(s, ind, axis=1)
f = make_sort(df[ids].to_numpy(), ind)
s = make_sort(df[['Box1_weight', 'Box2_weight']].to_numpy(), ind)
df2 = pd.DataFrame(np.concatenate([f,s], 1), columns=df.columns)
res1 = df2.groupby('ID1').Box1_weight.rolling(3, min_periods=1).mean().shift()
res2 = df2.groupby('ID2').Box2_weight.rolling(3, min_periods=1).mean().shift()
means = pd.concat([res1,res2], 1).rename(columns={'Box1_weight': 'w1', 'Box2_weight': 'w2'})
x = df.set_index([df.ID1.values, df.index])
final = x[ids].merge(means, left_index=True, right_index=True)[['w1','w2']].sum(1).sort_index(level=1)
df['final_weight'] = final.tolist()
ID1 ID2 Box1_weight Box2_weight final_weight
0 19 677 3 2 0.000000
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1

Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1

You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove rows from two DFs that have uncommon column value - python

Use query referencing the other dataframes Active.query('product_No in #User.product_No') Customer_ID product_No Rating 0 7 111 3.0 1 7 222 1.0 User.query('product_No in #Active.product_No') Customer_ID product_No Rating 0 9 111 2.0 1 9 222 5.0

Related

python piecewise linear interpolation across dataframes in a list

Pandas combining sparse columns in dataframe

pandas number of items in one column per value in another column

Pandas - Rolling average for a group across multiple columns; large dataframe

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

Categories

Resources