Keep rows in dataframe where values are present in another dataframe - python

I have two dataframes, df1 and df2:
Transport
City
Color
Car
Paris
red
Car
London
white
Bike
Paris
red
Car
New York
blue
Color
red
blue
blue
They are not the same length.
I want to make a new dataframe based on the first one, where I only keep the row, if the color is also present in the second dataframe, such that the output would be:
Transport
City
Color
Car
Paris
red
Bike
Paris
red
Car
New York
Blue
Is there a way to do that? I want to write something like:
df1[df1.Color.isin(df2.Color)]
But it does not seem to work.
edit: I think the issue is that the data type in the first dataframe is str and not in the second.

If both should match you need to merge and then filter out the nulls:
df1 = df1.merge(right=df2.drop_duplicates(), on=['City','Color'], how='left')
df1.dropna(subset=['Transport'], inplace = True)
Let me know if this works for you

You can try with a merge after dropping duplicates to avoid cartesian product:
df_1.merge(df_2.drop_duplicates(),left_on=['City','Color'],right_on=['City','Color'])
Outputting:
Transport City Color
0 Car Paris red
1 Bike Paris red
2 Car New York blue

Related

Pandas: Check values between columns in different dataframes and return list of multiple possible values into a new column

I am trying to compare two columns from two different dataframes, and return all possible matches using python: (Kinda of an xlookup in excel but with multiple possible matches)
Please see the details below for sample dataframes and work I attempted.
An explanation of the datasets below: Mark does not own any cars, however, there are several listed under his name, which we know that none belong to him. I am attempting to look at dataframe 1 (Marks) and compare it against the larger dataset that has all other owners and their cars: dataframe 2 (claimed) and return possible owners for Mark's cars as shown below.
Dataframe 1 : Marks
Marks = pd.DataFrame({'Car Brand': ['Jeep','Jeep','BMW','Volvo'],'Owner Name': ['Mark',
'Mark', 'Mark', 'Mark']})
Car Brand Owner Name
0 Jeep Mark
1 Jeep Mark
2 BMW Mark
3 Volvo Mark
Dataframe 2: claimed
Dataframe 2: claimed
claimed = pd.DataFrame({'Car Brand': ['Dodge', 'Jeep', 'BMW', 'Merc', 'Volvo', 'Jeep',
'Volvo'], 'Owner Name': ['Chris', 'Frank','Rob','Kelly','John','Chris','Kelly']})
Car Brand Owner Name
0 Dodge Chris
1 Jeep Frank
2 BMW Rob
3 Merc Kelly
4 Volvo John
5 Jeep Chris
6 Volvo Kelly
The data does have duplicate car brand names HOWEVER, the Owner Names are unique - meaning that Kelly even though she is mentioned twice IS THE SAME PERSON. same for Chris..etc
I want my Mark's dataframe to have a new column that looks like this:
Car Brand Owner Name Possible Owners
0 Jeep Mark [Frank, Chris]
1 Jeep Mark [Frank, Chris]
2 BMW Mark Rob
3 Volvo Mark [John, Kelly]
I have tried the below codes:
possible_owners = list()
for cars in Marks['Car Brand']:
for car_brands in claimed['Car Brand']:
if Marks.loc[Marks['Car Brand'].isin(claimed['Car Brand'])]:
sub = list()
sub.append()
possible_owners.append(sub)
else:
not_found = 'No possible Owners Identified'
possible_owners.append(not_found)
#Then I will add possible_owners as a new column to Marks
error code:ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(),
a.item(), a.any() or a.all().
I have also tried to do a merge, excel xlookup but (that has many limitations), and I am stuck trying to understand how to return possible matches even if there are multiple and line them up in one row.
Question: how can I compare the two frames, return possible values from the Owner Name column and put these values in a new column in Marks' table?
Excuse my code, I am fairly new to Python.
You could pre-process the claimed dataframe then merge:
lookup = claimed.groupby('Car Brand').apply(lambda x: x['Owner Name'].to_list()).to_frame()
df_m = Marks.merge(lookup, on='Car Brand', how='left').rename(columns={0:'Possible Owners'})
print(df_m)
Result
Car Brand Owner Name Possible Owners
0 Jeep Mark [Frank, Chris]
1 Jeep Mark [Frank, Chris]
2 BMW Mark [Rob]
3 Volvo Mark [John, Kelly]
You can always use a list comprehension with the df.Series.isin to do the work.
result = [claimed[claimed['Car Brand'].isin([i])]['Owner Name'].to_numpy() for i in Marks['Car Brand']]
Marks['Possible Owners'] = result
Car Brand Owner Name Possible Owners
0 Jeep Mark [Frank, Chris]
1 Jeep Mark [Frank, Chris]
2 BMW Mark [Rob]
3 Volvo Mark [John, Kelly]

Performing a V-lookup type operation on pandas for two datasets

I have two dataframes, both of which contain matching columns A and B. However the number of occurrences is not the same in each dataframe. Dataframe two contains a third column, which I would like to bring to my first dataframe and match the values on where columns A and B are the same.
The dataframes are not the same shape or size, DF #1 has a ton more columns and rows than displayed so I can't just lift it over.
Example:
DF #1
Ethnicity Region
Asian West
Asian West
Asian North
Black West
Black West
Black West
Mixed South
Mixed West
Mixed West
Mixed South East
DF #2
Ethnicity Region Population
Asian South East 278372
Asian East 32992
Asian South 33503
Asian East 86736
Asian East 58871
Asian North 66270
Black East 117442
Black East 69925
Black West 33614
Black West 13903
So essentially, I would like to do a V look up type function and create a new column in the first dataframe, which would tell me the population from the second dataframe.
So far I have done a groupby function which successfully sums the total number of residents per region in the second dataframe, but I am not sure how to move this to the first dataframe.
The reason behind this task is dataframe #1 contains a ton of other information which would benefit from the population figures from the second dataframe.
Any pointers/relevant documentation would be very helpful. Thanks.
You can just do merge
df2 = df2.groupby(['Ethnicity', 'Region']).sum().reset_index()
df1 = df1.merge(df2,how='left')

Populate Pandas dataframe with group_by calculations made in Pandas series

I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!
You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')

Plot against dummy variables and grouped values

This is some values of the table I have
country colour ...
1 Spain red
2 USA blue
3 Greece green
4 Italy white
5 USA red
6 USA blue
7 Spain red
I want to be able to group the countries together and plot it where the country is in the x axis and the total number of 'colours' is calculated for each country. For example, country USA has 2 blues and 1 red, Spain has 2 reds etc. I want this in a bar chart form. I would like this to be done using either matplotlib or seaborn.
I would assume I would have to use dummy variables for the 'colours' column but I'm not sure how to plot against a grouped column and dummy variables.
Much appreciated if you could show and explain the process. Thank you.
Try with crosstab:
pd.crosstab(df['country'], df['colour']).plot.bar()
Output:

How can I iterate through all rows of a dataframe to apply a look up function to a string value and apply the result to a new column?

I have a dataframe with several columns of personal data for each row (person). I want to apply a function to look up each person's city or state in regional lists, and then apply the result to a new column "Region" in the same dataframe.
I have been able to make the same operation work with a very simplified dataframe with categories for colors and vehicles (see below). But when I try to do it with the personal data, it won't work the same way and I don't understand why.
I've read through many theads on lambda functions, but I think what I'm asking is too complex for that. Most solutions deal with numerical data and I'm using strings, but as I said, I was able to make it work with one dataset. Obviously I'm new here. I'd also appreciate advice on how to build the new column as part of the function instead of having to build it as a separate step, but that isn't frustrating me as much as the main question.
This example works:
# Python: import pandas
import pandas as pd
# Simple dataframe. Empty column 'type'.
df = pd.DataFrame({'one':['1','2','3','4','5','6','7','8'],
'two':['A','B','C','D','E','F','G','H'],
'three': ['car','bus','red','blue','truck','pencil','yellow','green'],
'type':''})
df displays:
one two three type
0 1 A car
1 2 B bus
2 3 C red
3 4 D blue
4 5 E truck
5 6 F pencil
6 7 G yellow
7 8 H green
Now define lists and custom function:
# Definte lists of colors and vehicles
colors = ['red','blue','green','yellow']
vehicles = ['car','truck','bus','motorcycle']
# Create function 'celltype' to return values based on x
def celltype (x):
if x in colors: return 'color'
elif x in vehicles: return 'vehicle'
else: return 'other'
Then construct a loop to iterate through each row and apply the function:
# Write loop to iterate through df rows and apply function 'celltype' to column 'three' in each row
for index, row in df.iterrows():
row['type'] = celltype(row['three'])
And in this case the result is just what I want:
one two three type
0 1 A car vehicle
1 2 B bus vehicle
2 3 C red color
3 4 D blue color
4 5 E truck vehicle
5 6 F pencil other
6 7 G yellow color
7 8 H green color
This example doesn't work, and I don't know why:
df1 = pd.DataFrame({'Last Name':['SMITH','JONES','WILSON','DOYLE','ANDERSON'], 'First Name':['TOM','DICK','HARRY','MICHAEL','KEVIN'],
'Code':[12,34,56,78,90], 'Deparment':['Research','Management','Maintenance','Marketing','IT'],
'City':['NEW YORK','BOSTON','SAN FRANCISCO','DALLAS','DETROIT'], 'State':['NY','MA','CA','TX','MI'], 'Region':''})
df1 displays:
Last Name First Name Code Deparment City State Region
0 SMITH TOM 12 Research NEW YORK NY
1 JONES DICK 34 Management BOSTON MA
2 WILSON HARRY 56 Maintenance SAN FRANCISCO CA
3 DOYLE MICHAEL 78 Marketing DALLAS TX
4 ANDERSON KEVIN 90 IT DETROIT MI
Again, defining lists and functions:
# Define lists for regions
east = ['NEW YORK','BOSTON']
west = ['SAN FRANCISCO','LOS ANGELES']
south = ['TX']
# Create function 'region' to return values based on x
def region (x):
if x in east: return 'east'
elif x in west: return 'west'
elif x in south: return 'south'
else: return 'other'
# Write loop to iterate through df1 rows and apply function 'region' to column 'City' in each row
for index, row in df1.iterrows():
row['Region'] = region(row['City'])
if row['Region'] == 'other': row['Region'] = region(row['State'])
This results in an unchanged df1. The 'Region' column is still blank. We should see "east", "east", "west", "south", "other". The only difference in the code is the additional 'if' statement, to catch Dallas by state (which is something I need for my real world dataset). But I think that line is sound and I get the same result without it.
First off, apply and iterrows are slow, so try not to use them, ever.
What I usually do in this situation is to create a pair of forward and backward dicts:
forward = {'east': east,
'west': west,
'south': south}
backward = {x:k for k,v in forward.items() for x in v}
And then update with map. Since you want to update based on two columns, fillna will be helpful:
df1['Region'] = (df1['State'].map(backward)
.fillna(df1['City'].map(backward))
.fillna('other')
)
gives:
Last Name First Name Code Deparment City State Region
0 SMITH TOM 12 Research NEW YORK NY east
1 JONES DICK 34 Management BOSTON MA east
2 WILSON HARRY 56 Maintenance SAN FRANCISCO CA west
3 DOYLE MICHAEL 78 Marketing DALLAS TX south
4 ANDERSON KEVIN 90 IT DETROIT MI other
Your issue is with using iterrows. You should, in general, never modify something you are iterating over. In this case, the iterrows is creating a copy of your data and so is not actually modifying your df1. The copy is something that might or might not happen depending on the circumstances, so something like this is something you generally want to avoid doing.
You can make sure it modifies the original by calling the Dataframe directly with at:
for index, row in df1.iterrows():
df1.at[index, 'Region'] = region(row['City'])
if df1.at[index, 'Region'] == 'other': df1.at[index, 'Region'] = region(row['State'])

Categories