how to handle different spelling of column names when extracting data? - python

For this example i have 2 dataframes, the genre column in df1 is column 3 but in df2 it is column 2, also the header is slightly different. in my actual script i have to search the column names because the column location varies in each sheet it reads.
how do i recognise different header names as the same thing?
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
column_names = ['TITLE','VENDOR ID','GENRE(S)']
appended_data = []
sheet1 = df1[df1.columns.intersection(column_names)]
appended_data.append(sheet1)
sheet2 = df2[df2.columns.intersection(column_names)]
appended_data.append(sheet2)
appended_data = pd.concat(appended_data, sort=False)
output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 NaN
1 Shrek 8765 NaN
2 Frozen 8576 NaN
desired output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 Animation
1 Shrek 8765 Adventure
2 Frozen 8576 Family

Thank you for taking the time to do that. Asking a good questions is very important and now that you have posed a coherent question I was able to find a simple solution rather quickly:
import pandas as pd
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
Simple way:
We will use .append() below but for this to work, we need columns in df1 and df2 to match. In this case we'll simply replace df2's 'Genre' to 'GENRE(S)'
df2.columns = ['TITLE', 'GENRE(S)', 'VENDOR ID']
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
More elaborate:
Now, for a single use case this works but there may be cases where you have many mismatched columns and/or have to do this repeatedly. Here is a solution using boolean indexing to find mismatched names, then zip() and .rename() to map the column names:
# RELOAD YOUR ORIGINAL DF'S
df1_find = df1.columns[~df1.columns.isin(df2.columns)] # select col name that isnt in df2
df2_find = df2.columns[~df2.columns.isin(df1.columns)] # select col name that isnt in df1
zipped = dict(zip(df2_find, df1_find)) # df2_find as key, df1_find as value
df2.rename(columns=zipped, inplace=True) # map zipped dict to the column names
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
Keep in mind:
this way of doing it assumes that both your df's have the same count
of columns
this ALSO assumes that df1 has your ideal column names which you
will use against other dfs to fix their column names
I hope this helps.

Related

Search for multiple encounters across rows in pandas

I'm trying to take a dataframe of patient data and create a new df that includes their name and date if they had an encounter with three services on the same date.
first I have a dataframe
import pandas as pd
df = pd.DataFrame({'name': ['Bob', 'Charlie', 'Bob', 'Sam', 'Bob', 'Sam', 'Chris'],
'date': ['06-02-2023', '01-02-2023', '06-02-2023', '20-12-2022', '06-02-2023','08-06-2015', '26-08-2020'],
'department': ['urology', 'urology', 'oncology', 'primary care', 'radiation', 'primary care', 'oncology']})
I tried group by on the name and date with an agg function to create a list
df_group = df.groupby(['name', 'date']).agg({'department': pd.Series.unique})
For bob, this created made department contain [urology, oncology, radiation].
now when I try to search for the departments in the list, to then just find the rows that contain the departments in question, I get an error.
df_group.loc[df_group['department'].str.contains('primary care')]
for instance results in KeyError: '[nan nan nan nan nan] not in index'
I assume there is a much easier way but ultimately, I want to just get a dataframe of people with the date when they have an encounter for urology, oncology, and radiation. In the above df it would result in:
Name Date
Bob 06-02-2023
Easy solution
# define a set of departments to check for
s = {'urology', 'oncology', 'radiation'}
# groupby and aggregate to identify the combination
# of name and date that has all the required departments
out = df.groupby(['name', 'date'], as_index=False)['department'].agg(s.issubset)
Result
# out
name date department
0 Bob 06-02-2023 True
1 Charlie 01-02-2023 False
2 Chris 26-08-2020 False
3 Sam 08-06-2015 False
4 Sam 20-12-2022 False
# out[out['department'] == True]
name date department
0 Bob 06-02-2023 True

How to join tables of multiple events while preserving information?

So I have a use case where I have a few tables with different types of events in a time series, plus another table with base information. The events are of different types with different columns, for example an event of "marriage" could have the columns "husband name" and "wife name", and a table of events on "jobs" can have columns of "hired on" and "fired on" but can also have "husband name". The base info table is not time series data, and has stuff like "case ID" and "city of case".
The goal would be to 1. have all the different time series tables in one table with all possible columns, wherever there's no data in a column it's okay to have NaN. And 2. All entries in the time series should have all available data from the base data table.
For example:
df = pd.DataFrame(np.array([['Dave', 1,'call'], ['Josh', 2, 'rejection'], ['Greg', 3,'call']]), columns=['husband name', 'casenum', 'event'])
df2 = pd.DataFrame(np.array([['Dave', 'Mona', 1, 'new lamp'], ['Max', 'Lisa',1, 'big increase'],['Pete', 'Esther',3,'call'], ['Josh', 'Moana', 2, 'delivery']]), columns=['husband name','wife name','casenum', 'event'])
df3 = pd.DataFrame(np.array([[1, 'new york'],[3,'old york'], [2, 'york']]), columns=['casenum','city'])
I'm trying a concat:
concat = pd.concat([df, df2, df3])
This doesn't work, because we already know that for case num 1 the city is 'new york'
I'm trying a join:
innerjoin = pd.merge(df, df2, on='casenum', how='inner')
innerjoin = pd.merge(innerjoin, df3, on='casenum', how='inner')
This also isn't right, as I want to have a record of all the events from both tables. Also, interestingly enough, the result is the same for both inner and outer joins on the dummy data, however, on my actual data an inner join will result in more rows than the sum of both the event tables, which I don't quite understand.
Basically, my desired outcome would be:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Josh 2 rejection NaN york
2 Greg 3 call NaN old york
0 Dave 1 new lamp Mona new york
1 Max 1 big increase Lisa new york
2 Pete 3 call Esther old york
3 Josh 2 delivery Moana york
I've tried inner joins, outer joins, concats, none seem to work. Maybe I'm just too tired, but what do I need to do to get this output? Thank you!
I think you can merge twice with outer option:
(df.merge(df2,on=['husband name', 'casenum', 'event'], how='outer')
.merge(df3, on='casenum')
)
Output:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Dave 1 new lamp Mona new york
2 Max 1 big increase Lisa new york
3 Josh 2 rejection NaN york
4 Josh 2 delivery Moana york
5 Greg 3 call NaN old york
6 Pete 3 call Esther old york

How to find all records with the same ID between two DataFrames?

I have two DataFrames with movie review data from two different platforms (id, title, review, etc). All rows about a particular movie need to be removed from one DataFrame if that movie has not been reviewed in the other DataFrame. Here's an example:
import pandas as pd
data1 = [[1, 'Great movie!', 'Spiderman'], [1, 'Not my preference', 'Spiderman'], [2, 'Just average...', 'Captain America'], [4, 'Tolerable', 'Avengers']]
data2 = [[1, 'Did not think much of this', 'Spiderman'], [2, 'Great in my opinion!', 'Captain America'], [3, 'Could not finish', 'Batman Returns']]
df1 = pd.DataFrame(data1, columns = ['id', 'review', 'movie title'])
df2 = pd.DataFrame(data2, columns = ['id', 'review', 'movie title'])
df1.insert(3, "isValid", pd.Series(df1.id.isin(df2.id).values.astype(bool)))
df1 = df1[df1.isValid != False]
I'm wondering if there's a more efficient way to do this?
Thanks in advance for any help!
If you want to have the information in df1 of 'isValid' you can do this:
df1["isValid"] = df1.id.isin(df2.id)
new_df = df1.loc[df1.isValid == True]
id review movie title isValid
0 1 Great movie! Spiderman True
1 1 Not my preference Spiderman True
2 2 Just average... Captain America True
But if you don't care about 'isValid' and just used it in your answer for selection you can simply do this:
new_df = df1.loc[df1.id.isin(df2.id)]
id review movie title
0 1 Great movie! Spiderman
1 1 Not my preference Spiderman
2 2 Just average... Captain America
you are looking for merge function. This will drop all the ones not seen from both df1 and df2.
df1.merge(df2,on=["id","movie title"])
Out:
id review_x movie title review_y
0 1 Great movie! Spiderman Did not think much of this
1 1 Not my preference Spiderman Did not think much of this
2 2 Just average... Captain America Great in my opinion!
your df1 id the id ,review_x ,movie title and df2 is id, movie title review_y

Pandas not merging on different columns - Key error or NaN

I am trying to mimic my issue with my current data. I am trying to use pandas to merge two dataframes on different column names (Code and Number), and bring only one column from df2 (location). I get either Key error, or NaN.
Both were imported CSV files as data frames;
Both column names do not have white space;
Both columns have the same d.type
I have tried looking at other answers here, with literally copying and pasting the coded answers filling in my parts, and still get errors or NaN.
df1:
[['Name', 'Income', 'Favourite superhero', 'Code', 'Colour'],
['Joe', '80000', 'Batman', '10004', 'Red'],
['Christine', '50000', 'Superman', '10005', 'Brown'],
['Joey', '90000', 'Aquaman', '10002', 'Blue']
df2:
[['Number', 'Language', 'Location'],
['10005', 'English', 'Sudbury'],
['10002', 'French', 'Ottawa'],
['10004', 'German', 'New York']]
what I tried:
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_on='Number',
how='left')
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_index=True,
how='left')
I am trying to have df1 with the location column from df2 for each instance where Number
and Code are the same.
For both you commands work , you need Number existing in the right side dataframe. For the 1st command you need to drop Number columns after merge. For 2nd command, you need to set_index on the right sliced dataframe and no need to drop Number. I modified your command accordingly:
CSV1.merge(CSV2[['Number', 'Location']], left_on='Code', right_on='Number', how='left').drop('Number', 1)
Or
CSV1.merge(CSV2[['Number', 'Location']].set_index('Number'), left_on='Code', right_index=True, how='left')
Out[892]:
Name Income Favourite superhero Code Colour Location
0 Joe 80000 Batman 10004 Red New York
1 Christine 50000 Superman 10005 Brown Sudbury
2 Joey 90000 Aquaman 10002 Blue Ottawa

Pandas: Drop duplicates, with a constraint in another column

Title URL Price Address Rental_Type
0 House URL $600 Auburn Apartment
1 House URL $600 Auburn Apartment
2 House URL $900 NY Apartment
3 Room! URL $1018 NaN Office
4 Room! URL $910 NaN Office
I'm trying to drop duplicates under Title. But I only want to drop rows that have Rental_Type == 'Office'. I also have a second constraint. I would like to drop the rows with Rental_Type == 'Apartment', but I want to keep the first duplicate in this scenario. So in this situation row 3 and 4 would drop, and then only row 1 out of row 0/1.
I would build this up in steps to construct a list of incidences you wish to drop.
offices = df['Rental_Type'] == 'Office'
apts = df['Rental_Type'] == 'Apartment'
dup_offices = df[offices].duplicated('Title', keep=False)
dup_apts = df[apts].duplicated('Title', keep='first')
to_drop = pd.Index(dup_apts[dup_apts].index.tolist() + \
dup_offices[dup_offices].index.tolist())
df = df.drop(to_drop)
You can drop the duplicates with your constraints in this fashion:
#drop all duplicate with Rental_Type=='Office'
df1 = df[(df.Rental_Type=='Office')].drop_duplicates(['Title'], keep=False)
#Capture the duplicate row with Rental_Type=='Apartment'
df2 = df[(df.Rental_Type=='Apartment')].duplicated(['Title'], keep = 'last')
df3=df[(df.Rental_Type=='Apartment')][df2.values][1:]
#Put them together
df_final = pd.concat([df1,df3])
In [1]: df_final
Out[1]:
Title URL Price Address Rental_Type
1 House URL 600 Auburn Apartment

Categories