I have the following dataframe:
-----+-----+-------------+-------------+-------------------------+
| ID1 | ID2 | Box1_weight | Box2_weight | Average Prev Weight ID1 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 2 | - |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 0 | 2 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 3 | 1 | (0 + 3 )/2=1.5 |
+-----+-----+-------------+-------------+-------------------------+
| 19 | 677 | 7 | 0 | (3+0+3)/3=2 |
+-----+-----+-------------+-------------+-------------------------+
| 677 | 19 | 1 | 3 | (0+1+1)/3=0.6 |
I want to work out the moving average of weight the past 3 boxes, based on ID. I want to do this for all IDs in ID1.
I have put the column I want to calculate, along with the calculations is in the table above, labelled "Average Prev Weight ID1"
I can get a a rolling average for each individual column using the following:
df_copy.groupby('ID1')['Box1_weight'].apply(lambda x: x.shift().rolling(period_length, min_periods=1).mean())
However, this does not take into account that the item may also have been packed in the column labelled "Box2_weight"
How can I get a rolling average that is per ID, across the two columns?
Any guidance is appreciated.
Here is my attempt
stack the 2 ids and 2 weights columns to create dataframe with 1 ids and 1 weight column. Calculate the running average and assign back the running average for ID1 back to the dataframe
I have used your code of calculating rolling average but I arranged data to df2 before doing ti
import pandas as pd
d = {
"ID1": [19,677,19,19,677],
"ID2": [677, 19, 677,677, 19],
"Box1_weight": [3,1,3,7,1],
"Box2_weight": [2,0,1,0,3]
}
df = pd.DataFrame(d)
display(df)
period_length=3
ids = df[["ID1", "ID2"]].stack().values
weights = df[["Box1_weight", "Box2_weight"]].stack().values
df2=pd.DataFrame(dict(ids=ids, weights=weights))
rolling_avg = df2.groupby("ids")["weights"] \
.apply(lambda x: x.shift().rolling(period_length, min_periods=1)
.mean()).values.reshape(-1,2)
df["rolling_avg"] = rolling_avg[:,0]
display(df)
Result
ID1 ID2 Box1_weight Box2_weight
0 19 677 3 2
1 677 19 1 0
2 19 677 3 1
3 19 677 7 0
4 677 19 1 3
ID1 ID2 Box1_weight Box2_weight rolling_avg
0 19 677 3 2 NaN
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667
Not sure if this is what you want. I had trouble understanding your requirements. But here's a go:
ids = ['ID1', 'ID2']
ind = np.argsort(df[ids].to_numpy(), 1)
make_sort = lambda s, ind: np.take_along_axis(s, ind, axis=1)
f = make_sort(df[ids].to_numpy(), ind)
s = make_sort(df[['Box1_weight', 'Box2_weight']].to_numpy(), ind)
df2 = pd.DataFrame(np.concatenate([f,s], 1), columns=df.columns)
res1 = df2.groupby('ID1').Box1_weight.rolling(3, min_periods=1).mean().shift()
res2 = df2.groupby('ID2').Box2_weight.rolling(3, min_periods=1).mean().shift()
means = pd.concat([res1,res2], 1).rename(columns={'Box1_weight': 'w1', 'Box2_weight': 'w2'})
x = df.set_index([df.ID1.values, df.index])
final = x[ids].merge(means, left_index=True, right_index=True)[['w1','w2']].sum(1).sort_index(level=1)
df['final_weight'] = final.tolist()
ID1 ID2 Box1_weight Box2_weight final_weight
0 19 677 3 2 0.000000
1 677 19 1 0 2.000000
2 19 677 3 1 1.500000
3 19 677 7 0 2.000000
4 677 19 1 3 0.666667
Related
Given the following df:
SequenceNumber | ID | CountNumber | Side | featureA | featureB
0 0 | 0 | 3 | Sell | 4 | 2
1 0 | 1 | 1 | Buy | 12 | 45
2 0 | 2 | 1 | Buy | 1 | 4
3 0 | 3 | 1 | Buy | 3 | 36
4 1 | 0 | 1 | Sell | 5 | 11
5 1 | 1 | 1 | Sell | 7 | 12
6 1 | 2 | 2 | Buy | 5 | 35
I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.
NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.
The output should look like this:
SequenceNumber | From | To | featureA | featureB
0 0 | 0 | 1 | 12 | 45
1 0 | 0 | 2 | 1 | 4
2 0 | 0 | 3 | 3 | 36
3 1 | 0 | 2 | 5 | 11
4 1 | 1 | 2 | 7 | 12
I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?
Here is the original df:
df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1],
'ID': [0, 1, 2, 3, 0, 1, 2],
'CountNumber': [3, 1, 1, 1, 1, 1, 2],
'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
'featureA': [4, 12, 1, 3, 5, 7, 5],
'featureB': [2, 45, 4, 36, 11, 12, 35]})
You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:
features = list(df.filter(like='feature'))
out = (
# repeat the rows with CountNumber > 1
df.loc[df.index.repeat(df['CountNumber'])]
# rename Sell/Buy into from/to and de-duplicate the rows per group
.assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
)
# mask the features where CountNumber > 1
.assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
.drop(columns='CountNumber')
# reshape with a pivot
.pivot(index=['SequenceNumber', 'n'], columns='Side')
)
out = (
pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
.reset_index('SequenceNumber')
)
Output:
SequenceNumber from to featureA featureB
n
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
0 1 0 2 5.0 11.0
1 1 1 2 7.0 12.0
atlernative using a merge like suggested by ifly6:
features = list(df.filter(like='feature'))
df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))
df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))
out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
.set_index(['SequenceNumber', 'from', 'to'])
.filter(like='feature')
.pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
.reset_index()
)
Output:
SequenceNumber from to featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:
ndf = df.query('Side == "Sell"').merge(
df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
I then drop the side variable.
ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])
This creates a very wide table:
SequenceNumber From CountNumber_sell featureA_sell featureB_sell To CountNumber_buy featureA_buy featureB_buy
0 0 0 3 4 2 1 1 12 45
1 0 0 3 4 2 2 1 1 4
2 0 0 3 4 2 3 1 3 36
3 1 0 1 5 11 2 2 5 35
4 1 1 1 7 12 2 2 5 35
This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.
Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.
Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.
s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
ndf = s.merge(
b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf = ndf.drop(
columns=[i for i in ndf.columns if i.startswith('Side')
or i.endswith('_sell') or i.endswith('_buy')])
The final version of ndf then is:
SequenceNumber From To featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Here is an alternative approach
df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)
SequenceNumber From To featureA featureB
0 0 0 1 12 45
1 0 0 2 1 4
2 0 0 3 3 36
3 1 0 2 5 11
4 1 1 2 7 12
Problem:
I have a DataFrame like so:
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
name | category | amount
----------------------------
0 john | a | 100
1 jim | b | 200
2 eric | c | 13
3 jim | b | 23
4 john | a | 40
5 jim | b | 2
6 jim | c | 43
7 eric | c | 92
8 eric | a | 83
9 john | c | 1
I would like to add two new columns: first; the total amount for the relevant category for the name of the row (eg: the value in row 0 would be 140, because john has a total of 100 + 40 of the a category). Second; the counts of those name and category combinations which are being summed in the first new column (eg: the row 0 value would be 2).
Desired output:
The output I'm looking for here looks like this:
name | category | amount | sum_for_category | count_for_category
------------------------------------------------------------------------
0 john | a | 100 | 140 | 2
1 jim | b | 200 | 225 | 3
2 eric | c | 13 | 105 | 2
3 jim | b | 23 | 225 | 3
4 john | a | 40 | 140 | 2
5 jim | b | 2 | 225 | 3
6 jim | c | 43 | 43 | 1
7 eric | c | 92 | 105 | 2
8 eric | a | 83 | 83 | 1
9 john | c | 1 | 1 | 1
I don't want to group the data by the features because I want to keep the same number of rows. I just want to tag on the desired value for each row.
Best I could do:
I can't find a good way to do this. The best I've been able to come up with is the following:
names = df["name"].unique()
categories = df["category"].unique()
sum_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].sum() for j in categories
} for i in names}
df["sum_for_category"] = df.apply(lambda x: sum_for_category[x["name"]][x["category"]],axis=1)
count_for_category = {i:{
j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].count() for j in categories
} for i in names}
df["count_for_category"] = df.apply(lambda x: count_for_category[x["name"]][x["category"]],axis=1)
But this is extremely clunky and slow; far too slow to be viable on my actual dataset (roughly 700,000 rows x 10 columns). I'm sure there's a better and faster way to do this... Many thanks in advance.
You need two groupby.transform:
g = df.groupby(['name', 'category'])['amount']
df['sum_for_category'] = g.transform('sum')
df['count_or_category'] = g.transform('size')
output:
name category amount sum_for_category count_or_category
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
Another possible solution:
g = df.groupby(['name', 'category']).amount.agg(['sum','count']).reset_index()
df.merge(g, on = ['name', 'category'], how = 'left')
Output:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
import pandas as pd
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df_Count =
df.groupby(['name','category']).count().reset_index().rename({'amount':'Count_For_Category'}, axis=1)
df_Sum = df.groupby(['name','category']).sum().reset_index().rename({'amount':'Sum_For_Category'},axis=1)
df_v2 = pd.merge(df,df_Count[['name','category','Count_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2 = pd.merge(df_v2,df_Sum[['name','category','Sum_For_Category']], left_on=['name','category'], right_on=['name','category'], how='left')
df_v2
Hi There,
Use a simple code to easy understand, please try these code below, Just run it you will get what you want.
Thanks
Leon
I have a concept of what I need to do, but I can't write the right code to run, please take a look and give some advice.
step 1. find the rows that contains values in the second column
step 2. with those rows, compare the value in the first column with their previous row
step 3. drop the rows with larger first column value
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | 60 |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 1 | 180 |
| 1 | nan |
| 0 | 120 |
eg. I want to compare the missing values with the rows values in diff [120,180,60] and their previous rows. in the end, the desire dataframe will look like
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 0 | 120 |
update question according to the answer, got the same df as original df
import pandas as pd
import numpy as np
data={'missing':[0,1,1,0,0,1,1,0],'diff':[np.nan,60,np.nan,np.nan,np.nan,180,np.nan,120]}
df=pd.DataFrame(data)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
if df['diff'][ind]!=np.nan:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
IIUC, you can try:
m = df['diff'].notna()
df = (
pd.concat([
df[df['diff'].isna()],
df[m][df[m.shift(-1).fillna(False)]['missing'].values >
df[m]['missing'].values]
])
)
OUTPUT:
missing diff
1 0 <NA>
3 1 <NA>
4 0 <NA>
5 0 <NA>
7 1 <NA>
8 0 120
This will work for sure
for ind in df.index:
if np.isnan(df['diff'][ind])==False:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
This will work
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
import pandas as pd #import pandas
#define dictionary
data={'missing':[0,1,1,0,0,1,1,0],'diff':[nan,60,nan,nan,nan,180,nan,120]}
#dictionary to dataframe
df=pd.DataFrame(data)
print(df)
#for each row in dataframe
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
#only each row whose diff value is a number
#find the rows that contains values in the second column and compare it with previous value
if df['missing'][ind]>df['missing'][ind-1]:
#drop the rows with larger first column value
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
print(df)
Sorry in advance the number of images, but they help demonstrate the issue
I have built a dataframe which contains film thickness measurements, for a number of substrates, for a number of layers, as function of coordinates:
| | Sub | Result | Layer | Row | Col |
|----|-----|--------|-------|-----|-----|
| 0 | 1 | 2.95 | 3 - H | 0 | 72 |
| 1 | 1 | 2.97 | 3 - V | 0 | 72 |
| 2 | 1 | 0.96 | 1 - H | 0 | 72 |
| 3 | 1 | 3.03 | 3 - H | -42 | 48 |
| 4 | 1 | 3.04 | 3 - V | -42 | 48 |
| 5 | 1 | 1.06 | 1 - H | -42 | 48 |
| 6 | 1 | 3.06 | 3 - H | 42 | 48 |
| 7 | 1 | 3.09 | 3 - V | 42 | 48 |
| 8 | 1 | 1.38 | 1 - H | 42 | 48 |
| 9 | 1 | 3.05 | 3 - H | -21 | 24 |
| 10 | 1 | 3.08 | 3 - V | -21 | 24 |
| 11 | 1 | 1.07 | 1 - H | -21 | 24 |
| 12 | 1 | 3.06 | 3 - H | 21 | 24 |
| 13 | 1 | 3.09 | 3 - V | 21 | 24 |
| 14 | 1 | 1.05 | 1 - H | 21 | 24 |
| 15 | 1 | 3.01 | 3 - H | -63 | 0 |
| 16 | 1 | 3.02 | 3 - V | -63 | 0 |
and this continues for >10 subs (per batch), and 13 sites per sub, and for 3 layers - this df is a composite.
I am attempting to present the data as a facetgrid of heatmaps (adapting code from How to make heatmap square in Seaborn FacetGrid - thanks!)
I can plot a subset of the df quite happily:
spam = df.loc[df.Sub== 6].loc[df.Layer == '3 - H']
spam_p= spam.pivot(index='Row', columns='Col', values='Result')
sns.heatmap(spam_p, cmap="plasma")
BUT - there are some missing results, where the layer measurement errors (returns '10000') so I've replaced these with NaNs:
df.Result.replace(10000, np.nan)
To plot a facetgrid to show all subs/layers, I've written the following code:
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
sns.heatmap(d, **kwargs)
fig = sns.FacetGrid(spam, row='Wafer',
col='Feature', height=5, aspect=1)
fig.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False, cmap="plasma", annot=True, annot_kws={"size": 20})
which yields:
It has automatically adjusted axes to not show any positions where there is a NaN.
I have tried masking (see https://github.com/mwaskom/seaborn/issues/375) but just errors out with Inconsistent shape between the condition and the input (got (237, 15) and (7, 7)).
And the result of this is, when not using the cropped down dataset (i.e. df instead of spam, the code generates the following Facetgrid):
Plots featuring missing values at extreme (edge) coordinate positions make the plot shift within the axes - here all apparently to the upper left. Sub #5, layer 3-H should look like:
i.e. blanks in the places where there are NaNs.
Why is the facetgrid shifting the entire plot up and/or left? The alternative is dynamically generating subplots based on a sub/layer-count (ugh!).
Any help very gratefully received.
Full dataset for 2 layers of sub 5:
Sub Result Layer Row Col
0 5 2.987 3 - H 0 72
1 5 0.001 1 - H 0 72
2 5 1.184 3 - H -42 48
3 5 1.023 1 - H -42 48
4 5 3.045 3 - H 42 48
5 5 0.282 1 - H 42 48
6 5 3.083 3 - H -21 24
7 5 0.34 1 - H -21 24
8 5 3.07 3 - H 21 24
9 5 0.41 1 - H 21 24
10 5 NaN 3 - H -63 0
11 5 NaN 1 - H -63 0
12 5 3.086 3 - H 0 0
13 5 0.309 1 - H 0 0
14 5 0.179 3 - H 63 0
15 5 0.455 1 - H 63 0
16 5 3.067 3 - H -21 -24
17 5 0.136 1 - H -21 -24
18 5 1.907 3 - H 21 -24
19 5 1.018 1 - H 21 -24
20 5 NaN 3 - H -42 -48
21 5 NaN 1 - H -42 -48
22 5 NaN 3 - H 42 -48
23 5 NaN 1 - H 42 -48
24 5 NaN 3 - H 0 -72
25 5 NaN 1 - H 0 -72
You may create a list of unique column and row labels and reindex the pivot table with them.
cols = df["Col"].unique()
rows = df["Row"].unique()
pivot = data.pivot(...).reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
as seen in this answer.
Some complete code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
r = np.repeat([0,-2,2,-1,1,-3],2)
row = np.concatenate((r, [0]*2, -r[::-1]))
c = np.array([72]*2+[48]*4 + [24]*4 + [0]* 3)
col = np.concatenate((c,-c[::-1]))
df = pd.DataFrame({"Result" : np.random.rand(26),
"Layer" : list("AB")*13,
"Row" : row, "Col" : col})
df1 = df.copy()
df1["Sub"] = [5]*len(df1)
df1.at[10:11,"Result"] = np.NaN
df1.at[20:,"Result"] = np.NaN
df2 = df.copy()
df2["Sub"] = [3]*len(df2)
df2.at[0:2,"Result"] = np.NaN
df = pd.concat([df1,df2])
cols = np.unique(df["Col"].values)
rows = np.unique(df["Row"].values)
def draw_heatmap(*args, **kwargs):
data = kwargs.pop('data')
d = data.pivot(columns=args[0], index=args[1],
values=args[2])
d = d.reindex_axis(cols, axis=1).reindex_axis(rows, axis=0)
print d
sns.heatmap(d, **kwargs)
grid = sns.FacetGrid(df, row='Sub', col='Layer', height=3.5, aspect=1 )
grid.map_dataframe(draw_heatmap, 'Col', 'Row', 'Result', cbar=False,
cmap="plasma", annot=True)
plt.show()
I have these two DFs
Active:
Customer_ID | product_No| Rating
7 | 111 | 3.0
7 | 222 | 1.0
7 | 333 | 5.0
7 | 444 | 3.0
User:
Customer_ID | product_No| Rating
9 | 111 | 2.0
9 | 222 | 5.0
9 | 666 | 5.0
9 | 555 | 3.0
I want to find the ratings of the common products that both users rated (e.g. 111,222) and remove any uncommon products (e.g. 444,333,555,666). So the new DFs should be like this:
Active:
Customer_ID | product_No| Rating
7 | 111 | 3.0
7 | 222 | 1.0
User:
Customer_ID | product_No| Rating
9 | 111 | 2.0
9 | 222 | 5.0
I do not know how to do this without for loops. Can you help me, please
This is the code I have so far:
import pandas as pd
ratings = pd.read_csv("ratings.csv",names['Customer_ID','product_No','Rating'])
active=ratings[ratings['UserID']==7]
user=ratings[ratings['UserID']==9]
You can firstly get the common product_No using set intersection and then use isin method to filter on the original data frames:
common_product = set(active.product_No).intersection(user.product_No)
common_product
# {111, 222}
active[active.product_No.isin(common_product)]
#Customer_ID product_No Rating
#0 7 111 3.0
#1 7 222 1.0
user[user.product_No.isin(common_product)]
#Customer_ID product_No Rating
#0 9 111 2.0
#1 9 222 5.0
Use query referencing the other dataframes
Active.query('product_No in #User.product_No')
Customer_ID product_No Rating
0 7 111 3.0
1 7 222 1.0
User.query('product_No in #Active.product_No')
Customer_ID product_No Rating
0 9 111 2.0
1 9 222 5.0
I tried this using INNER JOIN as follows:
import pandas as pd
df1 = pd.read_csv('a.csv')
df2 = pd.read_csv('b.csv')
print df1
print df2
df_ij = pd.merge(df1, df2, on='product_No', how='inner')
print df_ij
df_list = []
for df_e,suffx in zip([df1,df2],['_x','_y']):
df_e = df_ij[['Customer_ID'+suffx,'product_No','Rating'+suffx]]
df_e.columns = list(df1)
df_list.append(df_e)
print df_list[0]
print df_list[1]
It gives the following output:
# print df1
Customer_ID product_No Rating
0 7 111 3
1 7 222 1
2 7 333 5
3 7 444 3
# print df2
Customer_ID product_No Rating
0 9 111 2
1 9 222 5
2 9 777 5
3 9 555 3
# print the INNER JOINed df
Customer_ID_x product_No Rating_x Customer_ID_y Rating_y
0 7 111 3 9 2
1 7 222 1 9 5
# print the first df you want, with common 'product_No'
Customer_ID product_No Rating
0 7 111 3
1 7 222 1
# print the second df you want, with common 'product_No'
Customer_ID product_No Rating
0 9 111 2
1 9 222 5
The inner join selects the common rows in each df. Since there are common column names, for columns not used in the join, the joined df has added suffixes to distinguish between those column names. You then need to simply extract columns to get your required final result, by just specifying the appropriate suffix.
There is a nice example of INNER JOIN here.
Your Answer for this question is....
import pandas as pd
dict1={"Customer_id":[7,7,7,7],
"Product_No":[111,222,333,444],
"rating":[3.0,1.0,5.0,3.0]}
active=pd.DataFrame(dict1)
dict2={"Customer_id":[9,9,9,9],
"Product_No":[111,222,666,555],
"rating":[2.0,5.0,5.0,3.0]}
user=pd.DataFrame(dict2)
df3=pd.merge(active,user,on="Product_No",how="inner")
df3
active=df3[["Customer_id_x","Product_No","rating_x"]]
print(active)
user=df3[["Customer_id_y","Product_No","rating_y"]]
print(user)