Match lists based on Name and DOB - python

This seems like it should be easy, but I can't seem to find what I'm looking for...I have two lists of people, FirstName, LastName, Date of Birth, and I just want to know which people are in both lists, and which ones are in one but not the other.
I've tried something like
common = pd.merge(list1, list2, how='left', left_on=['Last', 'First', 'DOB'], right_on=['Patient Last Name', 'Patient First Name', 'Date of Birth']).dropna()
Based on something else I found online, but it give me this error:
KeyError: 'Date of Birth'
I've verified that that is indeed the column heading in the second list, so I don't get what's wrong. Anyone do matching like this? What's the easiest/fastest way? The names may have different formatting between lists, like "Smith-Jones" vs. "SmithJones" vs. "Smith Jones", but I get around that by stripping all spances and punctuation from the names...I assume that's a first good step?

Try this , it should work
import sys
from StringIO import StringIO
import pandas as pd
TESTDATA=StringIO("""DOB;First;Last
2016-07-26;John;smith
2016-07-27;Mathew;George
2016-07-28;Aryan;Singh
2016-07-29;Ella;Gayau
""")
list1 = pd.read_csv(TESTDATA, sep=";")
TESTDATA=StringIO("""Date of Birth;Patient First Name;Patient Last Name
2016-07-26;John;smith
2016-07-27;Mathew;XXX
2016-07-28;Aryan;Singh
2016-07-20;Ella;Gayau
""")
list2 = pd.read_csv(TESTDATA, sep=";")
print list2
print list1
common = pd.merge(list1, list2, how='left', left_on=['Last', 'First', 'DOB'], right_on=['Patient Last Name', 'Patient First Name', 'Date of Birth']).dropna()
print common

Related

how to append rows into one row if the other rows have same certain name

orginal:
expected result:
Task:
I am trying to merge the 'urls column' into one row if there exist a same name in the other column ('full path') using python and jupyter notebook.
I have tried using groupby but it doesnt pass me the result i want.
Code:
df.groupby("Full Path").apply(lambda x: ", ".join(x)).reset_index()
not what i am expecting:
The reason it is not working is that you need to modify the column for the full path first before passing it to group by since there are differences in the full paths.
Based on the sample here the following should work:
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
This code of course assumes that the grouping you want for the full path occurs in the first two items. The \n will disappear when you write the df out to Excel.
NOTE: Unless the Type and Date fields are all the same value, you cannot include them in the group by since for example, if you did groupby(['Full Path', 'Type', 'Date']) you would end up with not all the links being aggregated for an individual path+folder combination. If you wanted them to be included in a comma-separated next line column like the url, you would need to add that to the agg statement and use the replace for those as well.
Code used for testing:
import pandas as pd
pd.options.display.max_colwidth = 999
data_dict = {
'Full Path': [
'downloads/Residences Singapore',
'downloads/Residences Singapore/15234523524352',
'downloads/Residences Singapore/41242341324',
],
'Type': [
'Folder',
'File',
'File',
],
'Date': [
'07-05-22 19:24',
'07-05-22 19:24',
'07-05-22 19:24',
],
'url': [
'https://www.google.com/drive/storage/345243534534522345',
'https://www.google.com/drive/storage/523405923405672340567834589065',
'https://www.google.com/drive/storage/90658360945869012141234',
],
}
df = pd.DataFrame(data_dict)
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
test
Output
Just groupby the FullPath and value as URL field, aggregate with comma separator. enter image description here

Pandas - merge two lists

I've been searching everywhere for a tip, however can't seem to find an answer.
I am trying to show items which have the same type
i.e. here's my dataset
What I want to end up with is a list of "Names" which are both a book and a movie.
i.e. the output should be "Harry Potter" and "LoTR".
i.e. a list like below with the "Name" column only which would show the two items:
I was thinking of doing a pivot, but not sure where to go from there.
ct = pd.crosstab(df["name"], df["type"]).astype(bool)
result = ct.index[ct["book"] & ct["movie"]].to_list()
please try this:
df_new = df[['Name','Type']].value_counts().reset_index()['Name'].value_counts().reset_index()
names = list(df_new[df_new['Name']>1]['index'].unique())
The above code gives all names with more than one type. If you want exactly names with two types, change the 2nd line to this:
names = list(df_new[df_new['Name']==2]['index'].unique())
You can use intersection of set:
>>> list(set(df.loc[df['Type'] == 'Movie', 'Name']) \
.intersection(df.loc[df['Type'] == 'Book', 'Name']))
['Harry Potter', 'LoTR']
Or
>>> df.loc[df['Type'] == 'Movie', 'Name'] \
.loc[lambda x: x.isin(df.loc[df['Type'] == 'Book', 'Name'])].tolist()
['Harry Potter', 'LoTR']

Convert a dataframe column into a list of object

I am using pandas to read a CSV which contains a phone_number field (string), however, I need to convert this field into the below JSON format
[{'phone_number':'+01 373643222'}] and put it under a new column name called phone_numbers, how can I do that?
Searched online but the examples I found are converting the all the columns into JSON by using to_json() which is apparently cannot solve my case.
Below is an example
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'phone_number': ['+1 569-483-2388', '+1 555-555-1212', '+1 432-867-5309']})
use map function like this
df["phone_numbers"] = df["phone_number"].map(lambda x: [{"phone_number": x}] )
display(df)

How do I split a names column in a pandas data frame if only some of the names have middle names?

I am working with a pandas data frame of names and there are a few different formats of names. Some are 'first' 'last, others are 'first' 'middle' 'last', and others are 'first initial' 'second initial' 'last'. I would like to split these into three columns by using the strings. I am currently trying to use the split function but I am getting "ValueError: Columns must be same length as key" because some names will split into two columns and others will be split into three. How can I get around this?
df = {'name': ['bradley efron', 'c arden pope', 'a l smith']}
mak_df[['First', 'Middle', 'Last']] = mak_df.Author_Name.str.split(" ", expand = True)
Here is a workaround:
import pandas as pd
list_of_names = ['bradley efron', 'c arden pope', 'a l smith']
new_list =[]
for name in list_of_names:
new_list.append(name.split(" "))
print(new_list)
for name in new_list:
if (len(name)==2):
name.insert(1," ")
print(new_list)
df = pd.DataFrame.from_records(new_list).T
df.index = ["first name","middle name","last name"]
df= df.T
print(df)
Output:
There's probably a better way to go about this, but here's what I've got:
df = {'name': ['bradley efron', 'c arden pope', 'a l smith']}
df=pd.DataFrame(df)
df=df['name'].str.split(' ',expand=True)
df.columns=['first','middle','last']
df['last']=np.where(df['last'].isnull(),df['middle'],df['last'])
df['middle']=np.where((df['middle']==df['last']),'',df['middle'])

Finding and replacing values in specific columns in a CSV file using dictionaries

My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.

Categories