Map three rows of data into a matrix - python

I have a dataset of movies ratings that looks as follow:
I want to map this into a matrix where the index in the user id, columns are the moviesids and values are the ratings.
What I have done so far is:
movies = df['movieId'].unique()
users = df['userId'].unique()
data_set = pd.DataFrame({'userId':users})
data_set = data_set.set_index('userId')
for movie in movies:
data_set[movie] = 0
So now I need to fill those spaces items with the corresponding ratings, but this is a messy and slow process.

Consider the dataframe df
df = pd.DataFrame([
[1, 11, 1],
[1, 12, 5],
[2, 11, 3],
[2, 13, 4]
], columns=['userid', 'movieid', 'rating'])
option 1
pivot
df.pivot('userid', 'movieid', 'rating')
option 2
set_index + unstack
df.set_index(['userid', 'movieid']).rating.unstack()
Both yield
movieid 11 12 13
userid
1 1.0 5.0 NaN
2 3.0 NaN 4.0
However, the unstack method has a fill_value parameter that allows to keep the integer dtype
df.set_index(['userid', 'movieid']).rating.unstack(fill_value=0)
movieid 11 12 13
userid
1 1 5 0
2 3 0 4

Related

Pandas merging on multi columns while ignoring NaN

A similar question was asked here Pandas merge on multiple columns ignoring NaN but without answer, so I'll ask maybe someone can help.
I need to merge values from df2 into df1, but the key used in the merge differs between rows in df2, as the rows in df2 have NaNs in different columns, and in that case I want to ignore those columns, and use for each row only the columns that have values.
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df1
level1 level2 level3
0 0 1 3
1 0 2 4
2 1 2 5
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
df2
level1 level2 level3 value
0 0.0 NaN NaN 10
1 0.0 1.0 NaN 12
2 NaN 2.0 5.0 13
When I do df1.merge(df2, how='left'), I get df1 with NaN in the value column, since there is no match on all the level columns as pandas is trying to match the Nan values as well.
What I do want is to get a match for any rows in df2 without trying to match the NaNs:
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
2 0 2 4 10
3 1 2 5 13
Explanation:
Row 0 in df1 has a match on the non-NaN columns of rows 0 and 1 in df2, so it gets values 10 and 12 from there. Row 1 in df1 has a match on the non-NaN columns of row 0 in df2, so it gets value 12 from there. Row 2 in df1 has a match on the non-NaN columns of row 2 in df2, so it gets value 13 from there.
In the real data I actually have 6 level columns and the non-NaN columns for each row in df2 can be any combination or a single column from there.
What I do now is to iterrows the rows in df2, create for each one a mini-dataframe of only the non-NaN columns, and merge df1 with it. But as we know, it's not really efficient, and I wonder it there something better that can be done.
I think I figured out a vectorized solution.
Fundamentally, the idea is that you merge df1 with df2 three separate times for the number of levels and then concat the dataframes together into one.
From there, you count how many columns are null and subtract from the number of levels. This tells you how many duplicates (or matches) are required in order to prevent the data from being dropped later.
Then, you calculate how many matches or duplicates there actually are. If actual is the same as required, then that means the row is a match, and it gets kept in the dataframe.
It's not pretty, but to improve my answer you could create a merging function to cut some of the code. Most importantly, it should be highly performant compared to looping through every row. As a final note, for the duplicates_required helper column, you will need to change the 3 to a 6 since you have 6 columns in your actual dataset and you will obviously need to repeat some of my merging code:
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
df2 = df2.assign(duplicates_required = 3 - df2.isnull().sum(axis=1))
df = pd.concat([
df1.merge(df2[['level1','value', 'duplicates_required']], on='level1'),
df1.merge(df2[['level2','value', 'duplicates_required']], on='level2'),
df1.merge(df2[['level3','value', 'duplicates_required']], on='level3')
])
cols = ['level1', 'level2', 'level3', 'value']
df['actual_duplicates'] = df.groupby(cols)['value'].transform('size')
df = (df[df['duplicates_required'].eq(df['actual_duplicates'])]
.drop_duplicates(subset=cols)
.drop(['duplicates_required', 'actual_duplicates'], axis=1)
.reset_index(drop=True))
df
Out[1]:
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
2 0 2 4 10
3 1 2 5 13
I think this works better than my previous answer using regex. Similar process, but a bit simpler to understand.
Do a full merge of the two dataframes
Compare across levels and count number of mismatches
Filter to rows where mismatch count == 0
import pandas as pd
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
levels_to_match = ['level1','level2','level3']
levels_to_match_df2 = [level + '_df2' for level in levels_to_match]
for df in [df1,df2]:
df['temp'] = 1
df1 = df1.merge(df2, on='temp', suffixes=[None,'_df2']).drop(columns=['temp'])
df1['mismatch'] = df1.apply(lambda x:
sum([(1 - (y == z or pd.isna(z))) for y, z in zip(list(x[levels_to_match]), list(x[levels_to_match_df2]))]),
axis=1)
df1 = df1.loc[df1['mismatch'] == 0, :].drop(columns=['mismatch'] + levels_to_match_df2)
print(df1)
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
3 0 2 4 10
8 1 2 5 13
Old answer with regex
Probably not ideal, but maybe try converting your levels into strings and regex expressions, then do a full merge of all possible combinations, and finally filter using a regex search/match across two helper columns (one from df1 and the other from df2).
Assuming the data you're matching on are either int or NaN then this seems to work okay. If you have other data types in your real data then the string/regex transformations will need to be adjusted accordingly.
import pandas as pd
import re
df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
levels_to_match = ['level1','level2','level3']
for df in [df1,df2]:
df['helper'] = df[levels_to_match].apply(list, axis=1)
df['helper'] = df['helper'].apply(lambda x: ','.join([str(int(item)) if pd.notna(item) else '.*' for item in x]))
df['temp'] = 1
df1 = df1.merge(df2.drop(columns=levels_to_match), on='temp', suffixes=[None,'_df2']).drop(columns=['temp'])
df1['match'] = df1.apply(lambda x: re.search(x['helper_df2'], x['helper']) is not None, axis=1)
df1 = df1.loc[df1['match'], :].drop(columns=['helper','helper_df2','match'])
print(df1)
level1 level2 level3 value
0 0 1 3 10
1 0 1 3 12
3 0 2 4 10
8 1 2 5 13

How to concat based on a condition in python

I would like to concat 2 Dataframes based on the same date to identify when a product is bought in a linear fashion.
Here's my code:
s = pd.Series(['01-2020', '02-2020', '03-2020', '04-2020', '05-2020', '06-2020', '07-2020', '08-2020', '09-2020', '10-2020', '11-2020', '12-2020'], name=Date)
data = [['01-2020', 5], ['02-2020', 3], ['03-2020', 1], ['05-2020', 4], ['06-2020', 8], ['08-2020', 3], ['09-2020', 11], ['10-2020', 5], ['12-2020', 3]]
df = pd.DataFrame(data, columns = ['Date Bought', 'Amount_Bought'])
result = pd.concat([df, s], axis=1, join="outer")
When I try to concat these dataframes the result is out of order.
I wish the output too look like this
Date Date_Bought Amount_Bought
01-2020 01-2020 5
02-2020 02-2020 3
03-2020 03-2020 1
04-2020 NaN 0
05-2020 05-2020 4
06-2020 06-2020 8
07-2020 NaN 0
08-2020 08-2020 3
09-2020 09-2020 11
10-2020 10-2020 5
11-2020 NaN 0
12-2020 12-2020 3
Use merge instead of concat - the latter will combine the columns from the set formed by the series and the data frame, and this is not what you would like to have. Also, remove the NaN using fillna on the column 'Amount_Bought'.
results = pd.merge(left = s, right = df, left_on = 'Date', right_on = 'Date Bought', how = 'left')
results[['Amount_Bought']] = results[['Amount_Bought']].fillna(value=0)

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

Match rows between dataframes and preserve order

I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.
Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN
Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']
Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!

Find the minimum value of a column greater than another column value in Python Pandas

I'm working in Python. I have two dataframes df1 and df2:
d1 = {'timestamp1': [88148 , 5617900, 5622548, 5645748, 6603950, 6666502], 'col01': [1, 2, 3, 4, 5, 6]}
df1 = pd.DataFrame(d1)
d2 = {'timestamp2': [5629500, 5643050, 6578800, 6583150, 6611350], 'col02': [7, 8, 9, 10, 11], 'col03': [0, 1, 0, 0, 1]}
df2 = pd.DataFrame(d2)
I want to create a new column in df1 with the value of the minimum timestamp of df2 greater than the current df1 timestamp, where df2['col03'] is zero. This is the way I did it:
df1['colnew'] = np.nan
TSs = df1['timestamp1']
for TS in TSs:
values = df2['timestamp2'][(df2['timestamp2'] > TS) & (df2['col03']==0)]
if not values.empty:
df1.loc[df1['timestamp1'] == TS, 'colnew'] = values.iloc[0]
It works, but I'd prefer not to use a for loop. Is there a better way to do this?
Use pandas.merge_asof with a forward direction
pd.merge_asof(
df1, df2.loc[df2.col03 == 0, ['timestamp2']],
left_on='timestamp1', right_on='timestamp2', direction='forward'
).rename(columns=dict(timestamp2='colnew'))
col01 timestamp1 colnew
0 1 88148 5629500.0
1 2 5617900 5629500.0
2 3 5622548 5629500.0
3 4 5645748 6578800.0
4 5 6603950 NaN
5 6 6666502 NaN
Give a try to the apply method.
def func(x):
values = df2['timestamp2'][(df2['timestamp2'] > x) & (df2['col03']==0)]
if not values.empty:
return values.iloc[0]
else:
np.NAN
df1["timestamp1"].apply(func)
You can create a separate function to do what has to be done.
The output is your new column
0 5629500.0
1 5629500.0
2 5629500.0
3 6578800.0
4 NaN
5 NaN
Name: timestamp1, dtype: float64
It is not an one-line solution, but it helps keeping things organised.

Categories