I have two dataframes colored by approximately matching marks:
df1:
df2:
The "marks" are not the same in each of them, but some are close.
How can I copy the "Evaluated" value from df2 to df1 based on relevant "name" and "mark"?
My code is:
df1 = pd.DataFrame({'Name': ['Lisa', 'Lisa', 'Lisa', 'Hann', 'Hann', 'Hann'],
'Marks': [25.123, 26.425, 27.456, 25.789, 26.124, 26.225],
'Evaluated':['','','','','','']})
df2 = pd.DataFrame({'Name':['Lisa', 'Lisa', 'Lisa', 'Lisa', 'Hann', 'Hann', 'Hann', 'Hann'],
'Marks':[25.125, 26.422, 27.451, 27.465, 25.786, 25.796, 26.121, 26.227],
'Evaluated':[0, 0, 1, 1, 1, 1, 1, 1]})
df3 = pd.merge(df1.round(2),
df2.round(2),
how='left',
on=['Name', 'Marks'])
Expected result is df3
df3:
How can I do an approximate match and get the value of the last column?
I tried to use df.loc and df.where but they didn't work because tables are in different shapes. What I expect is similar function to Excel's Vlookup function where approximation is True.
My code changes the values at the end, which I would love to keep as it was in df1. Probably I could make a copy from what I had before, but I believe there is a more pythonic way to solve it, rather than merging the tables.
Thanks in advance!
You can try pandas.merge_asof
df1 = df1.sort_values(['Marks'])
df2 = df2.sort_values(['Marks'])
df3 = pd.merge_asof(df1[['Name', 'Marks']],
df2,
on='Marks',
direction='nearest',
by='Name')
print(df3)
Name Marks Evaluated
0 Lisa 25.123 0
1 Hann 25.789 1
2 Hann 26.124 1
3 Hann 26.225 1
4 Lisa 26.425 0
5 Lisa 27.456 1
Related
I'm trying to identify the rows between 2 df which shared the same values for some columns for the SAME row.
Example:
import pandas as pd
df = pd.DataFrame([{'energy': 'power', 'id': '123'}, {'energy': 'gas', 'id': '456'}])
df2 = pd.DataFrame([{'energy': 'power', 'id': '456'}, {'energy': 'power', 'id': '123'}])
df =
energy id
0 power 123
1 gas 456
df2 =
energy id
0 power 456
1 power 123
Therefore, I'm trying to get the rows from df where energy & id matches exactly in the same row in df2.
If I do like this, I get a false result:
df2.loc[(df2['energy'].isin(df['energy'])) & (df2['id'].isin(df['id']))]
because this will match the 2 rows of df2 whereas I would expect only power / 123 to be matched
How should I do to do boolean indexing with multiple "dynamic" conditions based on another df rows and matching the values for the same rows in the other df ?
Hope it's clear
pd.merge(df, df2, on=['id','energy'], how='inner')
I have two DFs. I want to iterate through rows in DF1 and filter all the rows in DF2 with same id and get column"B" value in new columns of DF1.
data = {'id': [1,2,3]}
df1 = pd.DataFrame(data)
data = {'id': [1, 1, 3,3,3], 'B': ['ab', 'bc','ad','ds','sd']}
df2 = pd.DataFrame(data)
DF1 - id (15k rows)
DF2 - id, col1 (50M rows)
Desired output
data = {'id': [1,2,3],'B':['[ab,bc]','[]','[ad,ds,sd]']}
pd.DataFrame(data)
def func(df1):
temp3=df2.merge(pd.DataFrame(data=[df1.values]*len(df1),columns=df1.index),how='right',on=
['id'])
temp1 = temp3.B.values
return temp1
df1['B']=df1.apply(func,axis=1))
I am using merge for filtering and applying lambda function on df1. The code is taking 1 hour to execute on large data frame. How to make this run faster ?
Are you looking for a simple filter and grouped listification?
df2[df2['id'].isin(df1['id'])].groupby('id', as_index=False)[['B']].agg(list)
id B
0 1 [ab, bc]
1 2 [ca, as]
2 3 [ad, ds, sd]
Note that grouping as lists is considered suboptimal in terms of performance.
I am new to dataframe manipulation. I've been playing around with df.merge, df.join, pd.concat and I've been getting frequent errors while being unable to merge without duplicates.
I have two representative dataframes I want to merge.
df1 = pd.DataFrame({'1990' : 1, '1991': 2, '1992': 3}, index = ['a','b','c'])
df2 = pd.DataFrame({'1989':0,'1990' : 1, '1991': 2, '1992': 3, '1993': 4}, index = ['d'])
I want to merge them by the intersection of the columns of the two dataframes while adding the row at the same time. Is there a way to use a dataframe method to do this?
The final product should look like:
Use concat with inner join:
df = pd.concat([df1, df2], join='inner')
print (df)
1990 1991 1992
a 1 2 3
b 1 2 3
c 1 2 3
d 1 2 3
I have two dataframes :
df1:
A B C
1 ss 123
2 sv 234
3 sc 333
df2:
A dd xc
1 ss 123
df2 will always have a single row. How to check whether there is a match for that row of df2, in df1?
Using Numpy comparisons with np.all with parameter axis=1 for rows:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['ss', 'sv', 'sc'], 'C': [123, 234, 333]})
df2 = pd.DataFrame({'A': [1], 'dd': ['ss'], 'xc': [123]})
df3 = df1.loc[np.all(df1.values == df2.values, axis=1),:]
Or:
df3 = df1.loc[np.all(df1[['B','C']].values == df2[['dd','xc']].values, axis=1),:]
print(df3)
A B C
0 1 ss 123
Additional to Sandeep's answer, can do:
df1[np.all(df1.values == df2.values,1)].any().any()
For getting a boolean.
Or another way:
df1[(df2.values==df1.values).all(1)].any().any()
Or:
pd.merge(df1,df2).equals(df1)
Note: both output True
Check specific column (same as Sandeep's):
df1[col].isin(df2[col]).any()
How to check whether there is a match for that row of df2, in df1?
You can align columns and then check equality of df1 with the only row of df2:
df2.columns = df1.columns
res = (df1 == df2.iloc[0]).all(1).any() # True
The benefit of this solution is you aren't subsetting df1 (expensive), but instead constructing a Boolean dataframe / array (cheap) and checking if all values in at least one row are True.
This is still not particularly efficient as you are considering every row in df1 rather than stopping when a condition is satisfied. With numeric data, in particular, there are more efficient solutions.
I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?
In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3},
'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})
In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3},
'to_merge_on': {0: 1, 1: 3, 2: 5}})
In [6]: a
Out[6]:
col1 to_merge_on
a 1 1
b 2 3
c 3 4
In [7]: b
Out[7]:
col2 to_merge_on
0 1 1
1 2 3
2 3 5
In [8]: a.merge(b, how='left')
Out[8]:
col1 to_merge_on col2
0 1 1 1.0
1 2 3 2.0
2 3 4 NaN
In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')
EDIT: Switched to example code that can be easily reproduced
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN
Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.
You can make a copy of index on left dataframe and do merge.
a['copy_index'] = a.index
a.merge(b, how='left')
I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).
This approach would be superior when resetting index is expensive (large dataframe).
There is a non-pd.merge solution using Series.map and DataFrame.set_index.
a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
This doesn't introduce a dummy index name for the index.
Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.
df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)
This allows to preserve the index of df1
Assuming that the resulting df has the same number of rows and order as your first df, you can do this:
c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)
another simple option is to rename the index to what was before:
a.merge(b, how="left").set_axis(a.index)
merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis
You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]:
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:
First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')
Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:
First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()
Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):
First10ReviewsJoined.set_index('Line Number', inplace=True)
Then removed the index name of Line Number so that it remains blank:
First10ReviewsJoined.index.name = None
Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.
For the people that wants to maintain the left index as it was before the left join:
def left_join(
a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
if b_columns:
b_columns = set(on + b_columns)
b = b[b_columns]
df = (
a.reset_index()
.merge(
b,
how="left",
on=on,
)
.set_index(keys=[x or "index" for x in a.index.names])
)
df.index.names = a.index.names
return df