python panda: return indexes of common rows

python panda: return indexes of common rows - python

Apologies, if this is a fairly newbie question. I was trying to find which rows are common between two data frames. The return values should be the row indexes of df2 that are common with df1. My clunky example:
df1 = pd.DataFrame({'col1':['cx','cx','cx2'], 'col2':[1,4,12]})
df1['col2'] = df1['col2'].map(str);
df2 = pd.DataFrame({'col1':['cx','cx','cx','cx','cx2','cx2'], 'col2':[1,3,5,10,12,12]})
df2['col2'] = df2['col2'].map(str);
df1['idx'] = df1[['col1','col2']].apply(lambda x: '_'.join(x),axis=1);
df2['idx'] = df2[['col1','col2']].apply(lambda x: '_'.join(x),axis=1);
df1['idx_values'] = df1.index.values
df2['idx_values'] = df2.index.values
df3 = pd.merge(df1,df2,on = 'idx');
myindexes = df3['idx_values_y'];
myindexes.to_csv(idir + 'test.txt',sep='\t',index = False);
The return values should be [0,4,5]. It would be great to have this done efficiently since the two dataframes would have several million rows.

New column with join values is not necessary, merge by default inner merge by both columns and if need values of df2.index add reset_index:
df1 = pd.DataFrame({'col1':['cx','cx','cx2'], 'col2':[1,4,12]})
df2 = pd.DataFrame({'col1':['cx','cx','cx','cx','cx2','cx2'], 'col2':[1,3,5,10,12,12]})
df3 = pd.merge(df1,df2.reset_index(), on = ['col1','col2'])
print (df3)
col1 col2 index
0 cx 1 0
1 cx2 12 4
2 cx2 12 5
For both indexes need:
df4 = pd.merge(df1.reset_index(),df2.reset_index(), on = ['col1','col2'])
print (df4)
index_x col1 col2 index_y
0 0 cx 1 0
1 2 cx2 12 4
2 2 cx2 12 5
For only intersection of both DataFrames:
df5 = pd.merge(df1,df2, on = ['col1','col2'])
#if 2 column DataFrame
#df5 = pd.merge(df1,df2)
print (df5)
col1 col2
0 cx 1
1 cx2 12
2 cx2 12

This can easily be done by merging (inner join) both dataframes:
common_rows = pd.merge(df1, df2.reset_index(), how='inner', on=['idx_values'])

Related

How to compare two dataframes in Python pandas and output the difference?

I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.

Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2

How to apply a function on a series of columns, based on the values in a corresponding series of columns?

I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!

Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0

Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)

How to do left join on several dataframe

I have a several dataframes with the same name. Each dataframe has one row and two columns. One column is common in all of dataframes. I would like to left-join them together. Assuming the name of dataframes is same. I have no plan on differing their names from each other as they are so many of them and I am just putting a few of them here. Is there any way that i can do left join them and generate the desired output mentioned below?
Here is the dataframes:
col1 col2_4
0 1 2
col1 col2_9
0 1 10
col1 col2_1
0 1 12
col1 col2_3
0 1 5
Output:
col1 col2_4 col2_9 col2_1 col_3
0 1 2 10 12 5
Code:
group = df.groupby([randomcolumnname])
for name, groups in group:
#do some stuff for groups
print(groups)
#I want to join the groups dataframes after this line(some groups dataframes are given above)
Thanks in advance!

I believe you need for left join merge with list of DataFrames by column col1:
dfs = [df1, df2, df3, df4]
from functools import reduce
df = df_final = reduce(lambda left,right: pd.merge(left,right,on='col1', how='left'), dfs)
print (df)
col1 col2_1 col2_2 col2_3 col2_4
0 1 2 10 12 5
Or for outer join create index by set_index and concat:
df = pd.concat([x.set_index('col1') for x in dfs], axis=1).reset_index()
print (df)
col1 col2_1 col2_2 col2_3 col2_4
0 1 2 10 12 5
EDIT:
I think better is use custom function with GroupBy.apply:
def func(x):
print (x)
#do some stuff for groups
return x
group = df.groupby([randomcolumnname]).apply(func)
If not possible, for lsit of DataFrames use:
dfs = []
group = df.groupby([randomcolumnname])
for name, groups in group:
#do some stuff for groups
print(groups)
dfs.append(groups)

How to make the values of a pandas dataframe column as column

I would like to reshape my dataframe:
from Input_DF
col1 col2 col3
Course_66 0\nCourse_67 1\nCourse_68 0 a c
Course_66 1\nCourse_67 0\nCourse_68 0 a d
to Output_DF
Course_66 Course_67 Course_68 col2 col3
0 0 1 a c
0 1 0 a d
Please, note that col1 contains one long string.
Please, any help would be very appreciated.
Many Thanks in advance.
Best Regards,
Carlo

Use:
#first split by whitespaces to df
df1 = df['col1'].str.split(expand=True)
#for each column split by \n and select first value
df2 = df1.apply(lambda x: x.str.split(r'\\n').str[0])
#for columns select only first row and select second splitted value
df2.columns = df1.iloc[0].str.split(r'\\n').str[1]
print (df2)
0 Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
#join to original, remove unnecessary column
df = df2.join(df.drop('col1', axis=1))
print (df)
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
Another solution with list comprehension:
L = [[y.split('\\n')[0] for y in x.split()] for x in df['col1']]
cols = [x.split('\\n')[1] for x in df.loc[0, 'col1'].split()]
df1 = pd.DataFrame(L, index=df.index, columns=cols)
print (df1)
Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
EDIT:
#split values by whitespaces - it split by \n too
df1 = df['course_vector'].str.split(expand=True)
#select each pair columns
df2 = df1.iloc[:, 1::2]
#for columns select each unpair value in first row
df2.columns = df1.iloc[0, 0::2]
#join to original
df = df2.join(df.drop('course_vector', axis=1))

Since your data are ordered in value, key pairs, you can split on newlines and multiple spaces with regex to get a list, and then take every other value starting at the first position for values and the second position for labels and return a Series object. By applying, you will get back a DataFrame from these multiple series, which you can then combine with the original DataFrame.
import pandas as pd
df = pd.DataFrame({'col1': ['0\nCourse_66 0\nCourse_67 1\nCourse_68',
'0\nCourse_66 1\nCourse_67 0\nCourse_68'],
'col2': ['a', 'a'], 'col3': ['c', 'd']})
def to_multiple_columns(str_list):
# take the numeric values for each series and column labels and return as a series
# by taking every other value
return pd.Series(str_list[::2], str_list[1::2])
# split on newlines and spaces
splits = df['col1'].str.split(r'\n|\s+').apply(to_multiple_columns)
output = pd.concat([splits, df.drop('col1', axis=1)], axis=1)
print(output)
Output:
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d

How to factorize two data frame meanwhile with python-pandas?

I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!

Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20

Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python panda: return indexes of common rows - python

This can easily be done by merging (inner join) both dataframes: common_rows = pd.merge(df1, df2.reset_index(), how='inner', on=['idx_values'])

Related

How to compare two dataframes in Python pandas and output the difference?

How to apply a function on a series of columns, based on the values in a corresponding series of columns?

How to do left join on several dataframe

How to make the values of a pandas dataframe column as column

How to factorize two data frame meanwhile with python-pandas?

Categories

Resources