Compare values of one column of dataframe in another dataframe

Compare values of one column of dataframe in another dataframe - python

I have 2 dataframes. df1 is
DATE
2020-05-20
2020-05-21
and df2 is
ID NAME DATE
1 abc 2020-05-20
2 bcd 2020-05-20
3 ggg 2020-05-25
4 jhg 2020-05-26
I want to compare the values of df1 with df2, for eg: taking first value of df1 i.e 2020-05-20 and find it in df2 and filter it and show output and subset the filtered rows.
My code is
for index,row in df1.iterrows():
x = row['DATE']
if x == df2['DATE']:
print('Found')
new = df2[df2['DATE'] == x]
print(new)
else:
print('Not Found')
But I am getting the following error:
ValueError: The truth value of a series is ambigious. Use a.empty,a.bool(),a.item(),a.any()

x == df2['DATE'] is a pd.Series (of Booleans), not a single value. You have to reduce that to a single Boolean value in order to evaluate that in a condition.
You can either use .any() or .all() depeding on what you need. I assumed you need .any() here.
for index,row in df1.iterrows():
x = row['DATE']
if (x == df2['DATE']).any():
print('Found')
new = df2[df2['DATE'] == x]
print(new)
else:
print('Not Found')
Also see here for a pure pandas solution for this.

you can create one extra column in df1 and use np.where to fill it.
import numpy as np
df1['Match'] = np.where(df1.DATE.isin(df2.DATE),'Found', 'Not Found')

this can also be done as a merge which I think makes it a bit clearer as it's only one line with no branching. You can also add the validate parameter to make sure that each key is unique in either the left of right dataset,
import pandas
df1 = pandas.DataFrame(['2020-05-20', '2020-05-21'], columns=['DATE'])
df2 = pandas.DataFrame({'Name': ['abc', 'bcd', 'ggg', 'jgh'],
'DATE': ['2020-05-20', '2020-05-20', '2020-05-25', '2020-05-26']})
df3 = df1.merge(right=df2, on='DATE', how='left')

Related

Comparing values from different rows in groupby

I would like to print each time inconsistency where the a start is different from the end from the previous row, grouped by the 'id' column. In the following data, the last row would be a case of inconsistency.
start,end,id
0,2,1
1,5,2
2,10,1
5,7,2
7,9,2
11,13,1
I have managed to do this using a for loop:
def check_consistency(df):
grouped_df = df.groupby('id')
for key, group in grouped_df:
df = pd.DataFrame()
df['start'] = group['start'].iloc[1:]
df['end'] = group['end'].shift().iloc[1:]
consistent = df['start'] == df['end']
if not all(consistent):
print(key)
print(df[consistent == False])
Is there a way to achieve the same goal without using a for loop and creating an auxiliar DataFrame?
Edit: following is the expected output.
DataFrame:
df = pd.DataFrame({'start': [0,1,2,5,7,11], 'end': [2,5,10,7,9,13], 'id': [1,2,1,2,2,1]})
Expected output:
1
start end
5 11 10.0

Firstly, we sort by id. Then make a mask comparing each start with previous row end and group by id.
For each group, the first entry of mask is defaulted to True since it has no previous row and is not to be selected for our extraction.
Finally, we select those rows with mask being False (start not equal to previous row end) by using .loc with the negation of the boolean mask.
df1 = df.sort_values('id', kind='mergesort') # Merge Sort for stable sort to maintain sequence other than sort key
mask = (df1['start']
.eq(df1['end'].shift())
.groupby(df1['id']).transform(lambda x: [True] + x.iloc[1:].tolist())
)
df1.loc[~mask]
Output:
start end id
5 11 13 1

Copy matching value from one df to another given multiple conditions

I have two dataframes. The first, df1, has a non-unique ID and a timestamp value in ms. The other, df2, has the non-unique ID, a separate unique ID, a start time and an end time (both in ms).
I need to get the correct unique ID for each row in df1 from df2. I would do this by...
match each non-unique ID in df1 to the relevant series of rows in df2
of those rows, find the one with the start and end range that contains the timestamp in df1
get the unique ID from the resulting row and copy it to a new column in df1
I don't think I can use pd.merge since I need to compare the df1 timestamp to two different columns in df2. I would think df.apply is my answer, but I can't figure it out.
Here is some dummy code:
df1_dict = {
'nonunique_id': ['abc','def','ghi','jkl'],
'timestamp': [164.3,2071.2,1001.7,846.4]
}
df2_dict = {
'nonunique_id': ['abc','abc','def','def','ghi','ghi','jkl','jkl'],
'unique_id': ['a162c1','md85k','dk102','l394j','dj4n5','s092k','dh567','57ghed0'],
'time_start': [160,167,2065,2089,1000,1010,840,876],
'time_end': [166,170,2088,3000,1009,1023,875,880]
}
df1 = pd.DataFrame(data=df1_dict)
df2 = pd.DataFrame(data=df2_dict)
And here is a manual test...
df2['unique_id'][(df2['nonunique_id'].eq('abc')) & (df2['time_start']<=164.3) & (df2['time_end']>=164.3)]
...which returns the expected output (the relevant unique ID from df2):
0 a162c1
Name: unique_id, dtype: object
I'd like a function that can apply the above manual test automatically, and copy the results to a new column in df1.
I tried this...
def unique_id_fetcher(nonunique_id,timestamp):
cond_1 = df2['nonunique_id'].eq(nonunique_id)
cond_2 = df2['time_start']<=timestamp
cond_3 = df2['time_end']>=timestamp
unique_id = df2['unique_id'][(cond_1) & (cond_2) & (cond_3)]
return unique_id
df1['unique_id'] = df1.apply(unique_id_fetcher(df1['nonunique_id'],df1['timestamp']))
...but that results in:
ValueError: Can only compare identically-labeled Series objects
(Edited for clarity)

IIUC,
you can do a caretsian product of both dataframes and do a merge, then apply your logic
you create a dict and map the values back onto your df1 using non_unique_id as the key.
df1['key'] = 'var'
df2['key'] = 'var'
df3 = pd.merge(df1,df2,on=['key','nonunique_id'],how='outer')
df4 = df3.loc[
(df3["timestamp"] >= df3["time_start"]) & (df3["timestamp"] <= df3["time_end"])
]
d = dict(zip(df4['nonunique_id'],df4['unique_id']))
df1['unique_id'] = df1['nonunique_id'].map(d)
print(df1.drop('key',axis=1))
nonunique_id timestamp unique_id
0 abc 164.3 a162c1
1 def 2071.2 dk102
2 ghi 1001.7 dj4n5
3 jkl 846.4 dh567

Drop columns if rows contain a specific value in Pandas

I am starting to learn Pandas. I have seen a lot of questions here in SO where people ask how to delete a row if a column matches certain value.
In my case it is the opposite. Imagine having this dataframe:
Where you want to know is, if any column has in any of its row the value salty, that column should be deleted, having as a result:
I have tried with several similarities to this:
if df.loc[df['A'] == 'salty']:
df.drop(df.columns[0], axis=1, inplace=True)
But I am quite lost at finding documentation onto how to delete columns based on a row value of that column. That code is a mix of finding a specific column and deleting always the first column (as my idea was to search the value of a row in that column, in ALL columns in a for loop.

Perform a comparison across your values, then use DataFrame.any to get a mask to index:
df.loc[:, ~(df == 'Salty').any()]
If you insist on using drop, this is how you need to do it. Pass a list of indices:
df.drop(columns=df.columns[(df == 'Salty').any()])
df = pd.DataFrame({
'A': ['Mountain', 'Salty'], 'B': ['Lake', 'Hotty'], 'C': ['River', 'Coldy']})
df
A B C
0 Mountain Lake River
1 Salty Hotty Coldy
(df == 'Salty').any()
A True
B False
C False
dtype: bool
df.loc[:, ~(df == 'Salty').any()]
B C
0 Lake River
1 Hotty Coldy
df.columns[(df == 'Salty').any()]
# Index(['A'], dtype='object')
df.drop(columns=df.columns[(df == 'Salty').any()])
B C
0 Lake River
1 Hotty Coldy

The following is locating the indices where your desired column matches a specific value and then drops them. I think this is probably the more straightforward way of accomplishing this:
df.drop(df.loc[df['Your column name here'] == 'Match value'].index, inplace=True)

Here's one possibility:
df = df.drop([col for col in df.columns if df[col].eq('Salty').any()], axis=1)

Pandas: Filter by values within multiple columns

I'm trying to filter a dataframe based on the values within the multiple columns, based on a single condition, but keep other columns to which I don't want to apply the filter at all.
I've reviewed these answers, with the third being the closest, but still no luck:
how do you filter pandas dataframes by multiple columns
Filtering multiple columns Pandas
Python Pandas - How to filter multiple columns by one value
Setup:
import pandas as pd
df = pd.DataFrame({
'month':[1,1,1,2,2],
'a':['A','A','A','A','NONE'],
'b':['B','B','B','B','B'],
'c':['C','C','C','NONE','NONE']
}, columns = ['month','a','b','c'])
l = ['month','a','c']
df = df.loc[df['month'] == df['month'].max(), df.columns.isin(l)].reset_index(drop = True)
Current Output:
month a c
0 2 A NONE
1 2 NONE NONE
Desired Output:
month a
0 2 A
1 2 NONE
I've tried:
sub = l[1:]
df = df[(df.loc[:, sub] != 'NONE').any(axis = 1)]
and many other variations (.all(), [sub, :], ~df.loc[...], (axis = 0)), but all with no luck.
Basically I want to drop any column (within the sub list) that has all 'NONE' values in it.
Any help is much appreciated.

You first want to substitute your 'NONE' with np.nan so that it is recognized as a null value by dropna. Then use loc with your boolean series and column subset. Then use dropna with axis=1 and how='all'
df.replace('NONE', np.nan) \
.loc[df.month == df.month.max(), l].dropna(axis=1, how='all')
month a
3 2 A
4 2 NONE

Replace values in pandas dataframe based on column names

I would like to replace the values in a pandas dataframe from another series based on the column names. I have the foll. dataframe:
Y2000 Y2001 Y2002 Y2003 Y2004 Item Item Code
34 43 0 0 25 Test Val
and I have another series:
Y2000 41403766
Y2001 45283735
Y2002 47850796
Y2003 38639101
Y2004 45226813
How do I replace the values in the first dataframe based on the values in the 2nd series?
--MORE EDITS:
To recreate the proble, code and data is here: umd.box.com/s/hqd6oopj6vvp4qvpwnj8r4lm3z7as4i3
Instructions to run teh code:
To run this code:
Replace data_dir in config_rotations.txt with the path to the input directory i.e. where the files are kept
Replace out_dir in config_rotations.txt with whatever output path you want
Run python code\crop_stats.py. The problem is in line 133 of crop_stats.py
--EDIT:
Based on #Andy's query, here's the result I want:
Y2000 Y2001 Y2002 Y2003 Y2004 Item Item Code
41403766 45283735 47850796 38639101 45226813 Test Val
I tried
df_a.replace(df_b)
but this does not change any value in df_a

You can construct a df from the series after reshaping and overwrite the columns:
In [85]:
df1[s.index] = pd.DataFrame(columns = s.index, data = s.values.reshape(1,5))
df1
Out[85]:
Y2000 Y2001 Y2002 Y2003 Y2004 Item Item Code
0 41403766 45283735 47850796 38639101 45226813 Test Val
So this uses the series index values to sub-select from the df and then constructs a df from the same series, here we have to reshape the array to make a single row df
EDIT
The reason my code above won't work on your real code is firstly when assigning you can't do this:
df.loc[(df['Country Code'] == replace_cnt) & (df['Item'] == crop)][s.index]
This is called chained indexing and raises a warning, see the docs.
So to correct this you can put the columns inside the []:
df.loc[(df['Country Code'] == replace_cnt) & (df['Item'] == crop),s.index]
Additionally pandas tries to align along index values and column names, if they don't match then you'll get NaN values so you can get around this by calling .values to get a np array which just becomes anonymous data that has no index or column labels, so long as the data shape is broadcast-able then it will do what you want:
df.loc[(df['Country Code'] == replace_cnt) & (df['Item'] == crop),s.index] = pd.DataFrame(columns=s.index, data=s.values.reshape(1, len(s.index))).values

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare values of one column of dataframe in another dataframe - python

you can create one extra column in df1 and use np.where to fill it. import numpy as np df1['Match'] = np.where(df1.DATE.isin(df2.DATE),'Found', 'Not Found')

Related

Comparing values from different rows in groupby

Copy matching value from one df to another given multiple conditions

Drop columns if rows contain a specific value in Pandas

Pandas: Filter by values within multiple columns

Replace values in pandas dataframe based on column names

Categories

Resources