pandas dataframe boolean indexing with multiple conditions from another df - python

I'm trying to identify the rows between 2 df which shared the same values for some columns for the SAME row.
Example:
import pandas as pd
df = pd.DataFrame([{'energy': 'power', 'id': '123'}, {'energy': 'gas', 'id': '456'}])
df2 = pd.DataFrame([{'energy': 'power', 'id': '456'}, {'energy': 'power', 'id': '123'}])
df =
energy id
0 power 123
1 gas 456
df2 =
energy id
0 power 456
1 power 123
Therefore, I'm trying to get the rows from df where energy & id matches exactly in the same row in df2.
If I do like this, I get a false result:
df2.loc[(df2['energy'].isin(df['energy'])) & (df2['id'].isin(df['id']))]
because this will match the 2 rows of df2 whereas I would expect only power / 123 to be matched
How should I do to do boolean indexing with multiple "dynamic" conditions based on another df rows and matching the values for the same rows in the other df ?
Hope it's clear

pd.merge(df, df2, on=['id','energy'], how='inner')

Related

Search in pandas Dataframe column value

I have the following data as pandas Dataframe:
df = pd.DataFrame({
'id': [1,2,3,4, 5],
'first_name': ['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy'],
'last_name': ['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler'],
'movie_ids': ['34,265,268,65',
'34,43,65,61',
'5,876,8',
'14,5,268',
'134,845,2']}).set_index(["id"], drop=False)
and a list of ids:
movie_ids = ['34','845']
I would like to get the indexes of those rows where any of the movie_ids' item is represented in the movie_ids column.
I was trying to convert the column value to list than filter on that, with that I get only the matched values:
result = list(filter(lambda x: set(countries).intersection(set(x.split(","))), df['movie_ids'].values))
than using loc fn to get only those rows:
df = df.loc[df['movie_ids'].isin(result)]
But I guess this is not the most efficient way, for example with the millions of rows.
df[df.movie_ids.str.contains(rf"\b{'|'.join(movie_ids)}\b")]
id first_name last_name movie_ids
id
1 1 Sheldon Copper 34,265,268,65
2 2 Raj Koothrappali 34,43,65,61
5 5 Amy Fowler 134,845,2

Get a value from dataframe with different shape based on two columns

I have two dataframes colored by approximately matching marks:
df1:
df2:
The "marks" are not the same in each of them, but some are close.
How can I copy the "Evaluated" value from df2 to df1 based on relevant "name" and "mark"?
My code is:
df1 = pd.DataFrame({'Name': ['Lisa', 'Lisa', 'Lisa', 'Hann', 'Hann', 'Hann'],
'Marks': [25.123, 26.425, 27.456, 25.789, 26.124, 26.225],
'Evaluated':['','','','','','']})
df2 = pd.DataFrame({'Name':['Lisa', 'Lisa', 'Lisa', 'Lisa', 'Hann', 'Hann', 'Hann', 'Hann'],
'Marks':[25.125, 26.422, 27.451, 27.465, 25.786, 25.796, 26.121, 26.227],
'Evaluated':[0, 0, 1, 1, 1, 1, 1, 1]})
df3 = pd.merge(df1.round(2),
df2.round(2),
how='left',
on=['Name', 'Marks'])
Expected result is df3
df3:
How can I do an approximate match and get the value of the last column?
I tried to use df.loc and df.where but they didn't work because tables are in different shapes. What I expect is similar function to Excel's Vlookup function where approximation is True.
My code changes the values at the end, which I would love to keep as it was in df1. Probably I could make a copy from what I had before, but I believe there is a more pythonic way to solve it, rather than merging the tables.
Thanks in advance!
You can try pandas.merge_asof
df1 = df1.sort_values(['Marks'])
df2 = df2.sort_values(['Marks'])
df3 = pd.merge_asof(df1[['Name', 'Marks']],
df2,
on='Marks',
direction='nearest',
by='Name')
print(df3)
Name Marks Evaluated
0 Lisa 25.123 0
1 Hann 25.789 1
2 Hann 26.124 1
3 Hann 26.225 1
4 Lisa 26.425 0
5 Lisa 27.456 1

Groupby, apply function and combine results in dataframe

I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?
Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0
Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()

Filtering and grouping rows in one DataFrame, by another DataFrame

I have two DFs. I want to iterate through rows in DF1 and filter all the rows in DF2 with same id and get column"B" value in new columns of DF1.
data = {'id': [1,2,3]}
df1 = pd.DataFrame(data)
data = {'id': [1, 1, 3,3,3], 'B': ['ab', 'bc','ad','ds','sd']}
df2 = pd.DataFrame(data)
DF1 - id (15k rows)
DF2 - id, col1 (50M rows)
Desired output
data = {'id': [1,2,3],'B':['[ab,bc]','[]','[ad,ds,sd]']}
pd.DataFrame(data)
def func(df1):
temp3=df2.merge(pd.DataFrame(data=[df1.values]*len(df1),columns=df1.index),how='right',on=
['id'])
temp1 = temp3.B.values
return temp1
df1['B']=df1.apply(func,axis=1))
I am using merge for filtering and applying lambda function on df1. The code is taking 1 hour to execute on large data frame. How to make this run faster ?
Are you looking for a simple filter and grouped listification?
df2[df2['id'].isin(df1['id'])].groupby('id', as_index=False)[['B']].agg(list)
id B
0 1 [ab, bc]
1 2 [ca, as]
2 3 [ad, ds, sd]
Note that grouping as lists is considered suboptimal in terms of performance.

Pandas: How can I iterate a for loop over 2 different data-frames?

I am trying to calculate fuzz ratios for multiple rows in 2 data frames:
df1:
id name
1 Ab Cd E
2 X.Y!Z
3 fgh I
df2:
name_2
abcde
xyz
I want to calculate the fuzz ratio between all the values in df1.name and df2.name_2:
To do that I have code:
for i in df1['name']:
for r in df2['name_2']:
print(fuzz.ratio(i,r))
But I want the final result to have the ids from df1 as well. It would ideally look like this:
final_df:
id name name_2 score
1 Ab Cd E abcde 50
1 Ab Cd E xyz 0
2 X.Y!Z abcde 0
2 X.Y!Z xyz 60
3 fgh I abcde 0
3 fgh I xyz 0
Thanks for the help!
You can solve your problem like this:
Create an empty DataFrame:
final = pandas.DataFrame({'id': [], 'name': [], 'name_2': [], 'score': []})
Iterate through the two DataFrames inserting the id, names, and score and concatenating it onto the final DataFrame:
for id, name in zip(df1['id'], df1['name']):
for name2 in df2['name_2']:
tmp = pandas.DateFrame({'id': id, 'name': name, 'name_2': name2, 'score': fuzz.ratio(name, name2)})
final = pandas.concat([final, tmp], ignore_index=True)
print(final)
There is probably a cleaner and more efficient way to do this, but I hope this helps.
I don't fully understand the application of lambda functions in pd.apply, but after some SO searching, I think this is a reasonable solution.
import pandas as pd
from fuzzywuzzy import fuzz
d = [{'id': 1, 'name': 'Ab Cd e'}, {'id': 2, 'name': 'X.Y!Z'}, {'id': 3, 'name': 'fgh I'}]
df1 = pd.DataFrame(d)
df2 = pd.DataFrame({'name_2': ['abcde', 'xyz']})
This is a cross join in pandas; a tmp df is required
pandas cross join no columns in common
df1['tmp'] = 1
df2['tmp'] = 1
df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)
You can .apply the function fuzz.ratio to columns in the df.
Pandas: How to use apply function to multiple columns
df['fuzz_ratio'] = df.apply(lambda row: fuzz.ratio(row['name'], row['name_2']), axis = 1)
df
I also tried setting an index on df1, but that resulted in its exclusion from the cross-joined df.

Categories