I have an Excel-File with two sheets.
One contains the df1:
Country City Population Planet
Germany Berlin 30500 Earth
Spain Madrid 21021 Earth
...
And the second contains the df2:
Country City Population Planet
Spain Madrid 21021 Earth
...
Now I want to compare the two dataframes and check if there are rows in df1 which are also in df2 and if yes then:
I want to add a new column to df1 which has the name double and just want to put an "X" if the row is in df1 and in df2.
# create string data
df1_str = '''Country,City,Population,Planet
Germany,Berlin,30500,Earth
Spain,Madrid,21021,Earth'''
df2_str = '''Country,City,Population,Planet
Spain,Madrid,21021,Earth'''
# read in to dataframe
df1 = pd.read_csv(io.StringIO(df1_str))
# read in to list for iteration
df1_list = pd.read_csv(io.StringIO(df1_str)).values.tolist()
df2_list = pd.read_csv(io.StringIO(df2_str)).values.tolist()
# join all columns and make a unique combination
df1_list = ["-".join(map(str, item)) for item in df1_list]
df2_list = ["-".join(map(str, item)) for item in df2_list]
# check the combinations exist in both data frame
common_flag = []
for item1 in df1_list:
for item2 in df2_list:
if item1 in item2: # you might prefer item1 == item2:
common_flag.append("X")
else:
common_flag.append(None)
# add the result to datagrame df1
df1["double"] = pd.Series(common_flag)
Make sure the column order are same in both dataframe when creating combination list.
output:
Country City Population Planet double
0 Germany Berlin 30500 Earth None
1 Spain Madrid 21021 Earth X
Related
I have a dataframe of people's addresses and names. I have a function that processes names that I want to apply. I am creating sub selections of people with matching addresses and applying the function to those groups.
To this point I have been using .loc to as follows
for x in df['address'].unique():
sub_selection = df.loc[df['address'] == x]
sub_selection.apply(lambda x: function(x), axis = 1)
Is there a more efficient way to approach this. I am looking into pandas .groupby() functionality, but i am struggling to get it to work.
df.groupby('address').agg(lambda x: function(x['names']))
Here is some sample data:
address, name, Unique_ID
1022 Boogie Woogie Ave, John Smith, np.nan
1022 Boogie Woogie Ave, Frederick Smith, np.nan
1022 Boogie Woogie Ave, John Jacob Smith, np.nan
3030 Sesame Street, Big Bird, np.nan
3030 Sesame Street, Elmo, np.nan
3030 Sesame Street, Big Yellow Bird, np.nan
My function itself has some moving parts, but basically I check the name against a reference dictionary I create. This process passes a few other steps, but returns a list of indexes where the name matches. I use those indexes to assign a shared unique id for matching names. In my example big bird and big yellow bird would match.
def function(x):
match_list = []
if x['name'] in __lookup_dict[0]:
match_list.append((__lookup_dict[0][x['name']))
#reduce all elements matching list to a single list of place ids matching all elements
result = set(match_list[0])
for s in match_list[1:]:
if len(result.intersection(s)) != 0:
result.intersection_update(s)
#take the reduce lists and assign each place id an unique id.
#note we are working with place ids not the sub df's index. They don't match
if pd.isnull(x['Unique_ID']):
uid = str(uuid.uuid4())
for g in result:
df.at[df.index[df.index == g].tolist()[0], 'Unq_ID'] = uid
else:
pass
return result
Try using
df.groupby('address').apply(lambda x: function(x['names']))
Edited:
Check this example. I've used a dataframe from another StackOverflow question
import pandas as pd
df = pd.DataFrame({
"City":["Delhi","Delhi","Mumbai","Mumbai","Lahore","Lahore"],
"Points":[90.1,90.3,94.1,95,89,90.5],
"Gender":["Male","Female","Female","Male","Female","Male"]
})
d = {k:v for v,k in enumerate(df.City.unique())}
df['idx'] = df['City'].replace(d)
print(df)
Output:
City Points Gender idx
0 Delhi 90.1 Male 0
1 Delhi 90.3 Female 0
2 Mumbai 94.1 Female 1
3 Mumbai 95.0 Male 1
4 Lahore 89.0 Female 2
5 Lahore 90.5 Male 2
So, try using
d = {k:v for v,k in enumerate(df['address'].unique())}
df['idx'] = df['address'].replace(d)
I have a dataframe with three columns denoting three zones of countries a user can be subscribed.In each of the three columns there is a list of countries (some countries are in all three columns)
In another dataframe I have a list of users and the countries they are in.
The objective is to identify what zone the user is in, if any and remark that they are or are not allowed to use the service in that country.
df1 contains the country the user is in and the zone the user is subscribed to, as well as other fields.
df2 contains the zones available and the list of countries for that zone, as well as other fields.
df1.head()
name alias3 status_y country
Thetis Z1 active Romania
Demis Z1 active No_country
Donis Z1 active Sweden
Rhona Z3 active Germany
Theau Z2 active Bangladesh
df2.head()
Zone 1 Zone 2 Zone 3
ALBANIA ALBANIA ALBANIA
BELGIUM BELGIUM BELGIUM
BULGARIA AUSTRIA AUSTRIA
NaN CROATIA CROATIA
NaN NaN DENMARK
I have written conditions listing one of the three zones the user is subscribed to.
I have written values that select the country the user is in, and checks if that country is in the zone the user is subscribed to.
conditions = [
(df1['alias3']=='Z1'),
(df1['alias3']=='Z2'),
(df1['alias3']=='Z3')
]
values = [
df1['country'].str.upper().isin(country_zone['Zone 1']),
df1['country'].str.upper().isin(country_zone['Zone 2']),
df1['country'].str.upper().isin(country_zone['Zone 3'])
]
df1['valid_country'] = np.select(conditions, values)
Is there a better way to do this in pandas?
One easy way would be:
def valid(sdf):
zone = sdf.alias3.iat[0][-1]
sdf["valid_country"] = sdf.country.str.upper().isin(df2[f"Zone {zone}"])
return sdf
df1 = df1.groupby("alias3").apply(valid)
groupby df1 over the alias3s and then
apply a function to the groups that checks if the uppered country names in the group's country column are in the respective column of df2 and stores the result in a column named valid_country
Another way would be to alter df2 a bit:
df2.columns = df2.columns.str.replace("one ", "")
df2 = (
df2.melt(var_name="alias3", value_name="country")
.dropna()
.assign(valid_country=True)
)
df2.country = df2.country.str.capitalize()
Transforming the column names from 'Zone 1/2/3' to 'Z1/2/3'
melt-ing: the Zone-column names into a column named alias3 with the respective country names in a column named country
Dropping the NaNs
Adding a column named valid_country all True
Capitalizing the country names
And then:
df1 = df1.merge(df2, on=["alias3", "country"], how="left")
df1.valid_country[df1.valid_country.isna()] = False
Left-merge-ing the result with df1 on the columns alias3 and country
Filling in the missing False in the column valid_country
Updated!
Hi,
my data contains the names of persons and a list of cities they lived in. I want to group them together following these conditions:
first_name and last_name are identical
or (if 1. doesn't hold) their last_name are the same and they have lived in at least one identical city.
The result should be a new column indicating the group id that each person belongs to.
The DataFrame df looks like this:
>>> df
last_name first_name cities
0 Dorsey Nancy [Moscow, New York]
1 Harper Max [Munich, Paris, Shanghai]
2 Mueller Max [New York, Los Angeles]
3 Dorsey Nancy [New York, Miami]
4 Harper Maxwell [Munich, Miami]
The new dataframe df_id should look like this. The order of id is irrelevant (i.e., which group gets id=1), but only observations that fulfill either condition 1 or 2 should get the same id.
>>> df_id
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1
4 Harper Maxwell [Munich, Miami] 2
My current code:
df= df.reset_index(drop=True)
#explode lists to rows
df_exploded = df.explode('cities')
# define id_couter and dictionionary for to match index to id
id_counter = 1
id_matched = dict()
# define id function
def match_id(df):
global id_counter
# check if index already matched
if df.index not in id_matched.keys():
# get all persons with similar names (condition 1)
select = df_expanded[(df_expanded['first_name']==df['first_name']) & df_expanded['last_name']==df['last_name'])]
# get all persons with same last_name and city (condition 2)
if select.empty:
select_2 = df_expanded[(df_expanded['last_name']==df['last_name']) & (df_expanded['cities'].isin(df['cities']))]
# create new id for this specific person
if select_2.empty:
id_matched[df.index] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select_2.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# set next id
id_counter += 1
# run function
df = df.progress_apply(match_id, axis=1)
# convert dict to DataFrame
df_id_matched = pd.DataFrame.from_dict(id_matched, orient='index')
df_id_matched['id'] = df_id_matched.index
# merge back together with df to create df_id
Does anyone have a more efficient way to perform this task? The data set is huge and it would take several days...
Thanks in advance!
Use:
#sample data was changed for lists for each cities
#like 'Moscow, New York' changed to 'Moscow', 'New York'
df_id = pd.DataFrame({'last_name':['Dorsey','Harper', 'Mueller', 'Dorsey'],
'first_name':['Nancy','Max', 'Max', 'Nancy'],
'cities':[['Moscow', 'New York'], ['Munich', 'Paris', 'Shanghai'], ['New York', 'Los Angeles'], ['New York', 'Miami']]})
#created default index values
df_id = df_id.reset_index(drop=True)
#explode lists to rows
df = df_id.explode('cities')
#get duplicates per 3 columns, get at least one dupe by index and sorting
s = (df.duplicated(['last_name','first_name','cities'], keep=False)
.any(level=0)
.sort_values(ascending=False))
#create new column with cumulative sum by inverted mask
df_id['id'] = (~s).cumsum().add(1)
print (df_id)
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1
I'm trying to merge 2 dataframes on multiple columns:['Unit','Geo','Region']. And, the condition is: When a value from right_df encounters an 'empty string' on left_df , it should consider as a match.
eg.,when first row of right_df joins with first row of left_df , we have a empty string for column:'Region' . So,need to consider the empty string as a match to 'AU' and get the final result 'DE".
left_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','ACCTEST1','ACCTEST1'],
'Geo':['AP','JAPAN','NA','Europe','Europe','','','AP','Europe','NA'],
'Region':['','','','France','BENELUX','','','','',''],
'Resp':['DE','FG','BO','MD','KR','PM','NJ','JI','HN','FG']})
right_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','DEV','ACCTEST1','TEST1','TEST2','DEV','TEST1','TEST2'],
'Geo':['AP','JAPAN','AP','NA','AP','Europe','Europe','Europe','AP','JAPAN','AP','Europe','Europe','Europe'],
'Region':['AU','JAPAN','ISA','USA','AU/NZ','France','CEE','France','ISA','JAPAN','ISA','BENELUX','CEE','CEE']})
I tried with the below code but it works only if the 'empty strings' have values. I'm struggling to add a condition saying 'consider empty string as a match' or 'ignore if right_df encounters empty string and continue with available match'. Would appreciate for any help. Thanks!!
result_df = pd.merge(left_df, right_df, how='inner', on=['Unit','Geo','Region'])
Use DataFrame.merge inside a list comprehension and perform the
left merge operations in the following order:
Merge right_df with left_df on columns Unit, Geo and Region and select column Resp.
Merge right_df with left_df(drop duplicate values in Unit and Geo) on columns Unit, Geo and select column Resp.
Merge right_df with left_df(drop duplicate values in Unit) on column Unit and select column Resp.
Then use functools.reduce with a reducing function Series.combine_first to combine the all the series in the list s and assign this result to Resp column in right_df.
from functools import reduce
c = ['Unit', 'Geo', 'Region']
s = [right_df.merge(left_df.drop_duplicates(c[:len(c) - i]),
on=c[:len(c) - i], how='left')['Resp'] for i in range(len(c))]
right_df['Resp'] = reduce(pd.Series.combine_first, s)
Result:
print(right_df)
Unit Geo Region Resp
0 DEV AP AU DE
1 DEV JAPAN JAPAN FG
2 DEV AP ISA DE
3 DEV NA USA BO
4 TEST1 AP AU/NZ PM
5 TEST2 Europe France NJ
6 ACCTEST1 Europe CEE HN
7 DEV Europe France MD
8 ACCTEST1 AP ISA JI
9 TEST1 JAPAN JAPAN PM
10 TEST2 AP ISA NJ
11 DEV Europe BENELUX KR
12 TEST1 Europe CEE PM
13 TEST2 Europe CEE NJ
Looks like there's some mismatch in your mapping, however you can use update method to handle empty strings:
# replace empty strings with nan
left_df = left_df.replace('', np.nan)
# replace np.nan with values from other dataframe
left_df.update(right_df, overwrite=False)
# merge
df = pd.merge(left_df, right_df, how='right', on=['Unit','Geo','Region'])
Hope this gives you some idea.
Title URL Price Address Rental_Type
0 House URL $600 Auburn Apartment
1 House URL $600 Auburn Apartment
2 House URL $900 NY Apartment
3 Room! URL $1018 NaN Office
4 Room! URL $910 NaN Office
I'm trying to drop duplicates under Title. But I only want to drop rows that have Rental_Type == 'Office'. I also have a second constraint. I would like to drop the rows with Rental_Type == 'Apartment', but I want to keep the first duplicate in this scenario. So in this situation row 3 and 4 would drop, and then only row 1 out of row 0/1.
I would build this up in steps to construct a list of incidences you wish to drop.
offices = df['Rental_Type'] == 'Office'
apts = df['Rental_Type'] == 'Apartment'
dup_offices = df[offices].duplicated('Title', keep=False)
dup_apts = df[apts].duplicated('Title', keep='first')
to_drop = pd.Index(dup_apts[dup_apts].index.tolist() + \
dup_offices[dup_offices].index.tolist())
df = df.drop(to_drop)
You can drop the duplicates with your constraints in this fashion:
#drop all duplicate with Rental_Type=='Office'
df1 = df[(df.Rental_Type=='Office')].drop_duplicates(['Title'], keep=False)
#Capture the duplicate row with Rental_Type=='Apartment'
df2 = df[(df.Rental_Type=='Apartment')].duplicated(['Title'], keep = 'last')
df3=df[(df.Rental_Type=='Apartment')][df2.values][1:]
#Put them together
df_final = pd.concat([df1,df3])
In [1]: df_final
Out[1]:
Title URL Price Address Rental_Type
1 House URL 600 Auburn Apartment