Pandas join two DataFrames where one is a pivot table

Pandas join two DataFrames where one is a pivot table - python

Say I have two DataFrames that look like the following:
df1:
movieID 1 2 3 4
userID
0 2 0 0 2
1 1 1 4 0
2 0 2 3 0
3 1 2 0 0
and
df2:
userID movieID
0 0 2
1 0 3
2 0 4
3 1 3
What I am trying to accomplish is joining the two so that df2 contains a new column with the associated rating of a user for a specific movie. Thus df2 in this example would become:
df2:
userID movieID rating
0 0 2 0
1 0 3 0
2 0 4 2
3 1 3 4
I don't believe that simply reformatting df2 to have the same shape as df1 would work because there is no guarantee that it will have all userIDs or movieIDs, and i've looked into the merge function but I'm confused on how to set the how and on parameters in this scenario. If anyone can explain how could I achieve this it would be greatly appreciated.

You can apply() by row to index df1.loc[row.userID, row.movieID].
Just make sure the dtype of df1.columns matches df2.userID, and df2.movieID matches df1.index.
df1.columns = df1.columns.astype(df2.movieID.dtype)
df1.index = df1.index.astype(df2.userID.dtype)
df2['rating'] = df2.apply(lambda row: df1.loc[row.userID, row.movieID], axis=1)
# userID movieID rating
# 0 0 2 0
# 1 0 3 0
# 2 0 4 2
# 3 1 3 4

Related

Combining multi one hot encoded columns in pandas dataframe along with removing duplicates

I have a dataframe that looks like this :
ID A B C
1 1 0 0
1 0 1 0
2 1 0 0
I want the output to be like this :
ID A B C
1 1 1 0
2 1 0 0
Kindly guide how to achieve this.

Use groupby.max with as_index=False to get 1 whenever there is at least one 1:
df.groupby('ID', as_index=False).max()
Output:
ID A B C
0 1 1 1 0
1 2 1 0 0

Pandas: idempotent/force join between dataframes with column overlap

I am working in a notebook, so if I run:
df1 = df1.join(series2)
It works fine. However, if I run it again, I receive the following error:
ValueError: columns overlap but no suffix specified
Because it is equivalent to df1 = df1.join(series2).join(series2). Is there any way I can force an overwrite on the overlapping columns without creating an endless amount of columns with the _y suffix?
Sample df1
index a
0 0
0 1
1 2
1 3
2 4
2 5
Sample series2
index b
0 1
1 2
2 3
Desired output from df1 = df1.join(series2)
index a b
0 0 1
0 1 1
1 2 2
1 3 2
2 4 3
2 5 3
Desired output from df1 = df1.join(series2); df1 = df1.join(series2)
# same as above because of forced overwrite on either the left or right join.
index a b
0 0 1
0 1 1
1 2 2
1 3 2
2 4 3
2 5 3

Group values by columns axis=1 dinamically

I have a matrix df with 70 columns.
id day_1 day_2 day_3 day_4 ... day_69 day_70
1 1 2 4 1 1 1
2 0 0 0 0 0 0
3 0 3 0 0 0 0
4 3 2 1 0 0 3
I would like to aggregate the columns dinamically by [2,7,10, etc.] number of days. I.e. [bi-daily, weekly, ten-daily, etc.]
E.g. one of the results for aggregation (sum) by 2 days would be a dataframe with 35 columns, see below:
id bi_daily_1 bi_daily_2 ...bi_daily_35
1 3 5 2
2 0 0 0
3 3 0 0
4 5 1 3
where :
bi_daily_1 = aggregation(day_1, day_2)
bi_daily_2 = aggregation(day_3, day_4) and so on...
Note: Real matrix shape is aprox (2000, 1500)

Use floor division based on the number of days to determine groups (df.shape[1] is the number of columns in the dataframe), then use groupby on these groups specifying the axis as 1 (columns). Then just rename the columns.
days = 2
result = df.groupby([x // days for x in range(df.shape[1])], axis=1).sum()
result.columns = [f'bi_daily_{n + 1}' for n in result.columns]
>>> result
bi_daily_1 bi_daily_2
id
1 3 5
2 0 0
3 3 0
4 5 1

This could work, using a list comprehension: split the dataframe into pairs of two consecutive columns, use the iloc notation, sum each new dataframe, then concat to get a new dataframe.
day_1 day_2 day_3 day_4
0 1 2 4 1
1 0 0 0 0
2 0 3 0 0
3 3 2 1 0
(pd.concat([df.iloc[:,[i,i+1]]
.sum(axis=1)
for i in range(0,df.shape[1],2)],
axis=1)
.add_prefix('bi_daily_')
)
bi_daily_0 bi_daily_1
0 3 5
1 0 0
2 3 0
3 5 1

Merge two data frames with taking max of two columns

I have two dataframes with the same form:
> df1
Day ItemId Quantity
1 1 2
1 2 3
1 4 5
> df2
Day ItemId Quantity
1 1 0
1 2 0
1 3 0
1 4 0
I'd like to merge df1 and df2 and if a row of ['Day','ItemId'] exists in both df1 and df2 take df1 which the max
I tried this command :
df = pd.concat([df1, df2]).groupby(level=0).max(df1['Quantity'],df2['Quantity'])

Use groupby by both columns in list and aggregate max:
df = pd.concat([df1, df2]).groupby(['Day','ItemId'], as_index=False)['Quantity'].max()
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
3 1 4 5
If possible multiple columns:
df = (pd.concat([df1, df2])
.sort_values(['Day','ItemId','Quantity'], ascending=[True, True, False])
.drop_duplicates(['Day','ItemId']))
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
2 1 4 5

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.

I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3

Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas join two DataFrames where one is a pivot table - python

Related

Combining multi one hot encoded columns in pandas dataframe along with removing duplicates

Pandas: idempotent/force join between dataframes with column overlap

Group values by columns axis=1 dinamically

Merge two data frames with taking max of two columns

return rows with unique pairs across columns

Categories

Resources