Merging/concatenating two datasets on a specific column (different lengths) [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two different datasets
df1
Name Surname Age Address
Julian Ross 34 Main Street
Mary Jane 52 Cook Road
len(1200)
df2
Name Country Telephone
Julian US NA
len(800)
df1 contains the full list of unique names; df2 contains less rows as many Name were not added.
I would like to get a final dataset with the full list of names in df1 (and all the fields that are there) plus the fields in df2. I would then expect a final dataset of length 1200 with some empty fields corresponding to the missing name in df2.
I have tried as follows:
pd.concat([df1.set_index('Name'),df2.set_index('Name')], axis=1, join='inner')
but it returns the length of the smallest dataset (i.e. 800).
I have also tried
df1.merge(df2, how = 'inner', on = ['Name'])
... same result.
I am not totally familiar with joining/merging/concatenating functions, even after reading the document https://pandas.pydata.org/docs/user_guide/merging.html .
I know that probably this question will be a duplicate of some others and I will be happy to delete it if necessary, but I would be really grateful if you could provide same help and explaining how to get the expected result:
df
Name Surname Age Address Country Telephone
Julian Ross 34 Main Street US NA
Mary Jane 52 Cook Road

IIUC, Use pd.merge like below:
>>> df1.merge(df2, how='left', on='Name')
Name Surname Age Address Country Telephone
0 Julian Ross 34 Main Street US NaN
1 Mary Jane 52 Cook Road NaN NaN
If you want to keep the number of rows of df1, you have to use how='left' in the case where there is no duplicate names in df2.
Read Pandas Merging 101

Related

Pandas dataframe combine rows representing the same entity [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 8 months ago.
I have data that contains several rows for each employee. Each row contains one attribute and its value. For example:
Worker ID
Last Name
First Name
Metric Name
Metric Value
1
Hanson
Scott
Attendance
98
1
Hanson
Scott
On time
35
2
Avery
Kara
Attendance
95
2
Avery
Kara
On time
57
I would like to combine rows based on worker id, taking metrics to their own columns like so:
Worker ID
Last Name
First Name
Attendance
On time
1
Hanson
Scott
98
35
2
Avery
Kara
95
57
I can do worker_data.pivot_table(values='Metric Value', index='Worker ID', columns=['Metric Name']), but that does not give me the first and last names as columns. What is the best Pandas way to merge these rows?
In your solution change index parameter by list and for avoid MultiIndex remove [] from column parameter:
df = (worker_data.pivot_table(index=['Worker ID','Last Name','First Name'],
columns='Metric Name',
values='Metric Value')
.reset_index()
.rename_axis(None, axis=1))

Pandas Dataframe : Using same category codes on different existing dataframes with same category

I have two pandas dataframes with some columns in common. These columns are of type category but unfortunately the category codes don't match for the two dataframes. For example I have:
>>> df1
artist song
0 The Killers Mr Brightside
1 David Guetta Memories
2 Estelle Come Over
3 The Killers Human
>>> df2
artist date
0 The Killers 2010
1 David Guetta 2012
2 Estelle 2005
3 The Killers 2006
But:
>>> df1['artist'].cat.codes
0 55
1 78
2 93
3 55
Whereas:
>>> df2['artist'].cat.codes
0 99
1 12
2 23
3 99
What I would like is for my second dataframe df2 to take the same category codes as the first one df1 without changing the category values. Is there any way to do this?
(Edit)
Here is a screenshot of my two dataframes. Essentially I want the song_tags to have the same cat codes for artist_name and track_name as the songs dataframe. Also song_tags is created from a merge between songs and another tag dataframe (which contains song data and their tags, without the user information) and then saved and loaded through pickle. Also it might be relevant to add that I had to cast artist_name and track_name in song_tags to type category from type object.
I think essentially my question is: how to modify category codes of an existing dataframe column?

Merge two data frames and retain unique columns

I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")

Comparing columns from two data frames

I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.
You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.

Filling in a pandas column based on existing number of strings

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.
It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Categories