I have two dataFrames that after merge by "Name" some rows retun NaN because the "Names" are incomplete.
df1
Name
Info 1
Walter
Adress 1
john wick
Adress 1
df2
Name
Info 2
Walter White
Male
john wick
Male
df2 = pd.merge(df1,df2,on='Name', how='left')
I'm geting
Name
Info 1
Info 2
Walter
NaN
NaN
john wick
Adress 1
Male
I Want
Name
Info 1
Info 2
Walter White
Adress 1
Male
john wick
Adress 1
Male
How can I treat rows, to try get values by substring, if return NaN? I dont know if use merge in first time was the best logic.
Try this:
df2 = pd.merge_asof(df1,df2,on='Name', how='left')
this depends on the resemblance of the different values
The reason its not working is because pandas doesn't consider "Walter" and "Walter White" as same values.
Thus when you perform a left join on df1 it keeps all the values of df1 and adds the values from df2 that have the same "Name" column values. Since walter is not present in df2 it adds NaN in info2 column(again "walter" and "walter white" are different).
One way you could solve this is by creating two separate columns for "First_Name" and "Last_Name" and then try merging on "First_Name"
something like
df1["First_Name"] = df1.apply(lambda row: row['Name'].split()[0], axis = 1)
df2["First_Name"] = df2.apply(lambda row: row['Name'].split()[0], axis = 1)
Then simply use the same merge as you did...
df2 = pd.merge(df1,df2,on='First_Name', how='left')
Related
I have two dataframes, df1 and df2, which have a common column heading, Name. The Name values are unique within df1 and df2. df1's Name values are a subset of those in df2; df2 has more rows -- about 17,300 -- than df1 -- about 6,900 -- but each Name value in df1 is in df2. I would like to create a list of Name values in df1 that meet certain criteria in other columns of the corresponding rows in df2.
Example:
df1:
Name
Age
Hair
0
Jim
25
black
1
Mary
58
brown
3
Sue
15
purple
df2:
Name
Country
phoneOS
0
Shari
GB
Android
1
Jim
US
Android
2
Alain
TZ
iOS
3
Sue
PE
iOS
4
Mary
US
Android
I would like a list of only those Name values in df1 that have df2 Country and phoneOS values of US and Android. The example result would be [Jim, Mary].
I have successfully selected rows within one dataframe that meet multiple criteria in order to copy those rows to a new dataframe. In that case pandas/Python does the iteration over rows internally. I guess I could write a "manual" iteration over the rows of df1 and access df2 on each iteration. I was hoping for a more efficient solution whereby the iteration was handled internally as in the single-dataframe case. But my searches for such a solution have been fruitless.
try:
df_1.loc[df_1.Name.isin(df_2.loc[df_2.Country.eq('US') & \
df_2.phoneOS.eq('Android'), 'Name']), 'Name']
Result:
0 Jim
1 Mary
Name: Name, dtype: object
if you want the result as a list just add .to_list() at the end
data = df1.merge(df2, on='Name')
data.loc[((data.phoneOS == 'Android') & (data.Country == "US")), 'Name'].values.tolist()
I have two datasets, df1 having columns
Date Name Text Label
John 1
Jack 0
Jim 1
(I only filled those fields that I need)
and df2 having columns
NickName Label
John 1
John 1
Wes 0
Jim 0
Jim 0
Jim 0
Martin 0
Name and Nickname indicate the same things: however some observations might be included in only one of the two columns. Label in df1 is not the same of Label in df2 (sad name choice), so I will need to rename Label in df2, for example with Index.
I would like to have in df2 also the column Label (from df1) for those values (Nickname) that are in df1 and, for those ones not in df1, the value -1.
The expected output should be
NickName Label Index
John 1 1
John 1 1
Wes 0 -1
Jim 0 0
Jim 0 0
Jim 0 0
Martin 0 0
...
Please note that all Name in df1 are in df2.
For renaming the column, I have no problem (using rename in pandas) but I would need actually to understand how to merge the two datasets in order to get the three columns and corresponding values as in the expected output. I am not familiar with merging/joining, but I would say that I would need something like
df1.append(df2)
You can use pd.DataFrame.merge and add suffixes to the columns so you can see which original DatFrame they came from.
df1.merge(
df2,
left_on='Name',
right_on='Nickname',
suffixes=('_left', '_right'),
how="outer",
)
I'm uncertain whether one method, or even the practice of merging dataframes, can achieve my intentions below- or whether I need to resort to writing my own functions using for loops.
I want to progressively build up a master dataframe comprising all possible column values from a number of smaller dataframes with variable column data. All the dataframes come from records with the same name convention and duplication of rows with the same name should be avoided
I want to successively merge each smaller dataframe into the master
No data should be lost. Where names are shared, values should be combined into the master dataframe's existing columns
No new columns should be created
If two smaller dataframes have different values in the same column I would like those values to share the same column in the master, list or string doesn't matter
When a smaller dataframe entry of the same name contains new values for previously unfilled columns they should be merged into existing rows rather than creating new rows
1. My dataframes
df_master = pd.DataFrame(columns=('Names','Age','Hair','Breakfast','Lunch','Dinner'))
df_lunch = pd.DataFrame([['Joe',16,'red','sandwich'],['Mary',22,'brown','carrot']],columns=('Names','Age','Hair','Lunch'))
df_ingredients = pd.DataFrame([['Joe','ham']],columns=('Names','Lunch',))
df_breakfast = pd.DataFrame([['Joe','fruit loops'],['Mary','toast']],columns=('Names','Breakfast',))
2. Attempt to progressively build up the master dataframe
df_master = pd.merge(df_master, df_lunch, on=['Names','Age','Hair','Lunch'], how='outer')
so far, so good (except the column order goes funny)
df_master = pd.merge(df_master, df_ingredients, on=['Names','Lunch'], how='outer')
joe has been given a new row, his ham hasn't been added to his sandwich
df_master = pd.merge(df_master, df_breakfast, on=['Names','Breakfast'], how='outer')
joe and mary have new rows, just to accommodate breakfast
3. How it should ideally look by this stage
df_base = pd.DataFrame(columns=('Names','Age','Hair','Breakfast','Lunch','Dinner'))
df_sofar = pd.DataFrame([['Joe',16,'red','fruit loops', 'sandwich, ham'],['Mary',22,'brown','toast','carrot']],columns=('Names','Age','Hair','Breakfast','Lunch'))
df_ideal = pd.merge(df_base, df_sofar, on=['Names','Age','Hair','Breakfast','Lunch'], how='outer')
shows how I'd like the final dataframe from 2. to look
Dinner Names Age Hair Breakfast Lunch
0 Joe 16 red fruit loops sandwich, ham
1 Mary 22 brown toast carrot
Am I going about this all wrong? Or is there something obvious I'm missing? Thanks!
Let's try concat + groupby + agg:
df = pd.concat(
[df_master, df_lunch, df_ingredients, df_breakfast]
)
g = df.groupby('Names', sort=False, as_index=False).agg(lambda x: ','.join(x.dropna()))
g['Age'] = df_lunch['Age']
Names Breakfast Dinner Hair Lunch Age
0 Joe fruit loops red sandwich,ham 16
1 Mary toast brown carrot 22
An Alternative
If you cast everything to string, you lose no information during the groupby:
df = pd.concat(
[df_master, df_lunch, df_ingredients, df_breakfast]
)
df.groupby('Names', sort=False, as_index=False).agg(
lambda x: ','.join(x.dropna().astype(str))
)
Names Age Breakfast Dinner Hair Lunch
0 Joe 16.0 fruit loops red sandwich,ham
1 Mary 22.0 toast brown carrot
I have a dataframe which is something like this
Victim Sex Female Male Unknown
Perpetrator Sex
Female 10850 37618 24
Male 99354 299781 92
Unknown 33068 156545 148
I'm planning to drop both the row indexed as 'Unknown' and the column named 'Unknown'. I know how to drop a row and a column but I was wondering whether you could drop a row and a column at the same time in pandas? If yes, how could it be done?
This should do the job, however it's not really at the same time, but no intermediate object is returned to you.
df.drop("Unknown", axis=1).drop("Unknown", axis=0)
So for a concrete Example:
df = pd.DataFrame([[1,2],[3,4]], columns=['A', 'B'], index=['C','D'])
print(df)
A B
C 1 2
D 3 4
the call
df.drop('B', axis=1).drop('C', axis=0)
returns
A
D 3
I think closest 'at the same time' is select by loc and difference:
print (df.index.difference(['Unknown']))
Index(['Female', 'Male'], dtype='object')
print (df.columns.difference(['Unknown']))
Index(['Female', 'Male'], dtype='object')
df = df.loc[df.index.difference(['Unknown']), df.columns.difference(['Unknown'])]
print (df)
Victim Sex Female Male
Perpetrator Sex
Female 10850 37618
Male 99354 299781
You can delete columns and rows at the same time in one line just with their position. For example, if you want delete column 2,3 and 5 and at the same time if you want to remove index 0,1 and 3 along with the last row of the dataframe, you can do this by following,
df.drop(df.columns[[2,3,5]], axis = 1).drop(df.index[[0,1,3,-1]])
So I have two dataframes: one where certain columns are filled in and one where others are filled in but some from the previous df are missing. Both share some common non-empty columns.
DF1:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 NaN
Charlie 3 20160627 NaN
DF2:
FirstName Uid JoinDate BirthDate
Bob 1 NaN 19910524
Alice 2 NaN 19950403
Result:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 19910524
Alice 2 NaN 19950403
Charlie 3 20160627 NaN
Assuming that these rows do not share index positions in their respective dataframes, is there a way that I can fill the missing values in DF1 with values from DF2 where the rows match on a certain column (in this example Uid)?
Also, is there a way to create a new entry in DF1 from DF2 if there isn't a match on that column (e.g. Uid) without removing rows in DF1 that don't match any rows in DF2?
EDIT: I updated the dataframes to add non-matching results in both dataframes that I need in the result df. I also updated my last question to reflect that.
UPDATE: you can do it setting the proper indices and finally resetting the index of joined DF:
In [14]: df1.set_index('FirstName').combine_first(df2.set_index('FirstName')).reset_index()
Out[14]:
FirstName Uid JoinDate BirthDate
0 Alice 2.0 NaN 19950403.0
1 Bob 1.0 20160628.0 19910524.0
2 Charlie 3.0 20160627.0 NaN
try this:
In [113]: df2.combine_first(df1)
Out[113]:
FirstName Uid JoinDate BirthDate
0 Bob 1 20160628.0 19910524
1 Alice 2 NaN 19950403