how to search index from another dataframe columns by passing dataframe columns - python

I have two dataframes df1 and df2.
df1
index emp_id name code
0 07 emp07 'A'
1 11 emp11 'B'
2 30 emp30 'C'
df2
index emp_id salary
0 06 1000
1 17 2000
2 11 3000
I want to store a map from df1['emp_id'] to df2.index.
Example: input array - ['emp11','B'] (from df1)
Expected output: [11, 2] # this is df1['emp_id'], df2.index
Code I am trying:
columns_to_idx = {emp_id: i for i, emp_id in
enumerate(list(DF1.set_index('emp_id').loc[DF2.index][['name', 'code']]))}

I think you need DataFrame.merge with inner join and DataFrame.reset_index for column from index for avoid lost it:
df = df1.merge(df2.reset_index(), on='emp_id')
print (df)
emp_id name code index salary
0 11 emp11 B 2 3000
Then is possible create MultiIndex and select by tuple:
df2 = (df1.merge(df2.reset_index(), on='emp_id')
.set_index(['name','code'])[['emp_id','index']])
print (df2)
emp_id index
name code
emp11 B 11 2
print (df2.loc[('emp11','B')].tolist())
[11, 2]

Related

How can I get the index values in DF1 to where DF1's column values match DF2's custom multiindex values?

I have two data frames: DF1 and DF2.
DF2 is essentially a randomly generated subset of rows in DF1.
I want to get the (integer) indexes of DF1 of the rows where there is a complete match of all column values in DF1.
I'm trying to do this with a multi-index:
So if I have the following:
DF1:
Index Name Age Gender Label
0 Kate 24 F 1
1 Bill 23 M 0
2 Bob 22 M 0
3 Billy 21 M 0
DF2:
MultiIndex Name Age Gender Label
(Bob,22,M) Bob 22 M 0
(Billy,21,M) Billy 21 M 0
Desired Output: [2,3]
How can I use that MultiIndex in DF2 to check DF1 for those matches?
I found this while searching but I think this requires you to specify what value you want beforehand? I can't find this exact use case.
df2.loc[(df2.index.get_level_values("Name" =='xxx') &
(df2.index.get_level_values('Age') == x &
(df2.index.get_level_values('Gender') == x)]
Please let me know the best way.
Thanks!
Edit (Code to generate df1):
Pseudocode: Merge two dataframes to get a total of 10 columns and
drop everything except 4 columns
Edit (Code to generate df2):
if amount_needed - len(lowest_value_keys) > 0:
extra_samples = df1[df1.Label==0].sample(n=amount_needed -len(lowest_value_keys) ,replace=False)
lowest_value_df = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = pd.concat([lowest_value_df, extra_samples])
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
else:
all_samples = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = all_samples.sample(n=amount_needed,replace=False)
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
Not sure if this answers your query, but if we first reset the index of df1 to get that as another column 'Index', and then set_index on Name, Age , Gender to find the matches on df2 and just take the resulting Index column would that work ?
So that would be:
df1.reset_index().set_index(['Name','Age','Gender']).loc[df2.set_index(['Name','Age','Gender']).index]['Index'].values

Match multiple columns on Python to a single value

I hope you are doing well.
I am trying to perform a match based on multiple columns where my values of Column B of df1 is scattered in three to four columns in df2. The goal here is the the return the values of Column A of df2 if values of Column B matches any values in the columns C,D,E.
What I did until now was actually to do multiple left merges (and changing the name of Column B to match the name of columns C,D,E of df2).
I am trying to simplify the process but I am unsure how I am supposed to do this?
My dataset looks like that:
Df1:
ID
0 77
1 4859
2 LSP
DF2:
X id1 id2 id3
0 AAAAA_XX 889 77 BSP
1 BBBBB_XX 4859 CC 998P
2 CCCC_YY YUI TYU LSP
My goal is to have in df1:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Thank you very much !
you can get all the values in the columns to one first with pd.concat
then we merge the tables like this:
df3 = pd.concat([df2.id1, df2.id2]).reset_index()
df1 = df2.merge(df3, how="left", left_on = df1.ID, right_on = df3[0])
df1 = df1.iloc[:, :2]
df1 = df1.rename(columns={"key_0": "ID"})
not the most beautiful code in the world, but it works.
output:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Use DataFrame.merge with DataFrame.melt:
df = df1.merge(df2.melt(id_vars='X', value_name='ID').drop('variable', axis=1),
how='left',
on='ID')
print (df)
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
If possible duplicated ID is possible use:
df = (df1.merge(df2.melt(id_vars='X', value_name='ID')
.drop('variable', axis=1)
.drop_duplicates('ID'),
how='left',
on='ID'))

Does condition selection preserve order in Pandas DataFrame?

For example,
df = pandas.DataFrame({'name':['a','b','c'], 'age':[10,20,30]})
name age
0 a 10
1 b 20
2 c 30
df[df['age'] > 10]
name age
1 b 20
2 c 30
My question is: Does Pandas make sure the index order is preserved?
Is any possible the result like:
name age
2 c 30
1 b 20
Thanks
Yes, filtering preserve order of rows (also index values).
Need to sort by column age if need change ordering:
df1 = df[df['age'] > 10].sort_values('age', ascending=False)
print (df1)
name age
2 c 30
1 b 20
It preserves the data order, doesn't sort the data by any attribute automatically.
Here you can see that:
df = pd.DataFrame({'name':['a','b','c'], 'age':[30,20,10]}, index=[1,0,2])
df[df['age']>10]
# age name
#1 30 a
#0 20 b

pandas - select last n rows of dataframe with respect to an attribute

My dataframe looks as below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-25,44
2,2016-10-27,12
Given the dataframe above, I want to select last 2 rows of ids to make a df2, and another df1 with the rest.
df1
id, date, target
1,2016-10-24,22
1,2016-10-25,31
2,2016-10-21,22
2,2016-10-22,31
df2
id, date, target
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-25,44
2,2016-10-27,12
How can I do this?
Thanks in advance.
You can use GroupBy.tail for creating df2, then get difference of original with df1 index and select by loc rows from df - this is df1:
df2 = df.groupby('id').tail(2)
print (df2)
id date target
2 1 2016-10-27 44
3 1 2016-10-28 12
6 2 2016-10-25 44
7 2 2016-10-27 12
print (df.index.difference(df2.index))
Int64Index([0, 1, 4, 5], dtype='int64')
df1 = df.loc[df.index.difference(df2.index)]
print (df1)
id date target
0 1 2016-10-24 22
1 1 2016-10-25 31
4 2 2016-10-21 22
5 2 2016-10-22 31
You can use df.groupby('id').tail(2): http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.tail.html

Join two pandas data frames with the indices of the first?

I have two dataframes, df1:
column1 column2
0 A B
1 A A
2 C A
3 None None
4 None None
and df2
id l
40 100005090 A
188 100020985 B
Now I want to join df1 and df2, but I don't know how to match the indices. If I simply do df1.join(df2), the indices are aligned to df2. That is, it finds the 40th entry of df2 and that is now the first entry of the dataframe that starts at 40 (df1). How do I tell pandas to align indices to df1, meaning that the first entry of df2 is actually index 40? That is, I would like to get:
id l column1 column2
40 100005090 A A B
188 100020985 B A A
...
You can take a slice of your df that is the same length as df1, then you can overwrite the index values and then join:
In [174]:
sub = df.iloc[:len(df1)]
sub.index = df1.index
df1.join(sub)
Out[174]:
id l column1 column2
40 100005090 A A B
188 100020985 B A A
If the dfs are the same length then the first line is not needed, you just overwrite the index with the index values from the other df.

Categories