Match values of different columns and dataframes - python

Check if value from one dataframe exists in another dataframe
df_threads = pd.DataFrame({'conv_id': ['12', '15', '14', '23'], 'tweets': ['Nice', 'Test', 'Hi', 'Yes']})
df_orig = pd.DataFrame({'id': ['17', '12', '15', '6','67'], 'tweets': ['hola', 'mundo', 'yes', 'look', 'Beautiful']})
I would like to know if the value of the column "id" of the dataframe df_orig is in the column "conv_id" of df_threads and if it is true, add a column "File" in df_threads with the value "Spanish" and if it False set Portuguese.
Expected Output
df_threads
conv_id tweet File
---------------------------------------------
12 "Nice" "Spanish"
15 "Test" "Spanish"
14 "Hi" "Spanish"
23 "Yes" "Spanish"
56 "NotHappy" "Portuguese"

Use np.where and df.where:
df_threads['File'] = np.where(df_threads['conv_id'].isin(df_orig['id']),
'Spanish', 'Portuguese')
>>> df_threads
conv_id tweets File
0 12 Nice Spanish
1 15 Test Spanish
2 14 Hi Portuguese
3 23 Yes Portuguese
Note: check your input data and your outcome, they not match.

Related

Converting to dataframe, beginner question

I have a piece of data that looks like this
my_data[:5]
returns:
[{'key': ['Aaliyah', '2', '2016'], 'values': ['10']},
{'key': ['Aaliyah', '2', '2017'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2018'], 'values': ['21']},
{'key': ['Aaliyah', '2', '2019'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2020'], 'values': ['15']}]
The key represents Name, Gender, and Year. The value is number.
I do not manage to generate a data frame with columns name, gender, year, and number.
Can you help me?
Here is one way, using a generator:
from itertools import chain
pd.DataFrame.from_records((dict(zip(['name', 'gender', 'year', 'number'],
chain(*e.values())))
for e in my_data))
Without itertools:
pd.DataFrame(((E:=list(e.values()))[0]+E[1] for e in my_data),
columns=['name', 'gender', 'year', 'number'])
output:
name gender year number
0 Aaliyah 2 2016 10
1 Aaliyah 2 2017 26
2 Aaliyah 2 2018 21
3 Aaliyah 2 2019 26
4 Aaliyah 2 2020 15

Several dictionary with duplicate key but different value and no limit in column

Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.
First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.
You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24

How can I replace multiple rows simultaneously in a Python dataframe?

I have a dataset with the following unique values in one of its columns.
df['Gender'].unique()
array(['Female', 'M', 'Male', 'male', 'm', 'Male-ish', 'maile',
'Trans-female', 'Cis Female', 'something kinda male?', 'Cis Male',
'queer/she/they', 'non-binary', 'Make', 'Nah', 'All', 'Enby',
'fluid', 'Genderqueer', 'Androgyne', 'Agender', 'Guy (-ish) ^_^',
'male leaning androgynous', 'Male ', 'Man', 'msle', 'Neuter',
'queer', 'A little about you', 'Malr',
'ostensibly male, unsure what that really means')]
As you can see, there are obvious cases where a row should be listed as 'Male' (I'm referring to the cases where 'Male' is misspelled, of course). How can I replace these values with 'Male' without calling the replace function ten times? This is the code I have tried:
x=0
while x<=11:
for i in df['Gender']:
if i[0:2]=='Ma':
print('Male')
elif i[0]=='m':
print('Male')
x+=1
However, I just get a print of a bunch of "Male".
Edit: I want to convert the following values to 'Male': 'M', 'male', 'm', 'maile', 'Make', 'Man', 'msle', 'Malr', 'Male '
Create a list with all the nicknames of Male:
males_list = ['M', 'male', 'm', 'maile', 'Make', 'Man', 'msle', 'Malr', 'Male ']
And then replace them with:
df.loc[df['Gender'].isin(males_list), 'Gender'] = 'Male'
btw: There is always a better solution than looping the rows in pandas, not just in cases like this.
I would use the map function as it allows you to create any custom logic. So for instance, by looking at your code, something like this would do the trick:
def correct_gender(text):
if text[0:2]=='Ma' or text[0]=='m':
return "Male"
return text
df["Gender"] = df["Gender"].map(correct_gender)
If I understand you correctly, you want a more generalized approach. We can use regex to check if the word starts with M or has the letters Ma preceded by a whitespace, so we dont catch Female:
(?i): stands for ignore case sensitivity
?<=\s: means all the words which start with ma and are preceded by a whitespace
df.loc[df['Gender'].str.contains('(?i)^M|(?<=\s)ma'), 'Gender'] = 'Male'
Output
Gender
0 Female
1 Male
2 Male
3 Male
4 Male
5 Male
6 Male
7 Trans-female
8 Cis Female
9 Male
10 Male
11 queer/she/they
12 non-binary
13 Male
14 Nah
15 All
16 Enby
17 fluid
18 Genderqueer
19 Androgyne
20 Agender
21 Guy (-ish) ^_^
22 Male
23 Male
24 Male
25 Male
26 Neuter
27 queer
28 A little about you
29 Male
30 Male

How to restructure dataframe to convert column values to row values based on condition

I have a dataframe with 5 columns and want to convert 2 of the columns (Chemo and Surgery) based on their values (greater than 0) to rows (diagnosis series) and add the information like the individual id and diagnosis at age to the rows.
Here is my dataframe
import pandas as pd
data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]
df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])
print df
I have tried to get the values where the Chemo/Surgery is greater than 0 but when I tried to add it as a row, it doesn't work.
This is what I want the end result to be.
ID Diagnosis Age at Diagnosis
0 A-1 Birth 0
1 A-1 Lung cancer 25
2 A-1 Chemo 25
3 A-1 Surgery 25
4 A-1 Death 50
5 A-2 Birth 0
6 A-2 Brain cancer 12
7 A-2 Chemo 12
8 A-2 Skin cancer 20
9 A-2 Chemo 20
10 A-2 Surgery 20
11 A-2 Current age 23
12 A-3 Birth 0
13 A-3 Brain cancer 30
14 A-3 Surgery 30
15 A-3 Lung cancer 33
16 A-3 Chemo 33
17 A-3 Current age 35
This is one of the things I have tried:
chem = "Chemo"
try_df = (df[chem] > 1)
nd = df[try_df]
df["Diagnosis"] = df[chem]
print df
We can melt the two columns Chemo and Surgery, then drop all the zero and concat back:
# melt the two columns
new_df = df[['ID', 'Chemo', 'Surgery']].melt(id_vars='ID',
value_name='Age at Diagnosis',
var_name='Diagnosis')
# filter out the zeros
new_df = new_df[new_df['Age at Diagnosis'].ne('0')]
# concat with the original dataframe, ignoring the extra columns
new_df = pd.concat((df,new_df), sort=False, join='inner')
# sort values
new_df.sort_values(['ID','Age at Diagnosis'])
Output:
ID Diagnosis Age at Diagnosis
0 A-1 Birth 0
1 A-1 Lung cancer 25
1 A-1 Chemo 25
12 A-1 Surgery 25
2 A-1 Death 50
3 A-2 Birth 0
4 A-2 Brain cancer 12
4 A-2 Chemo 12
5 A-2 Skin cancer 20
5 A-2 Chemo 20
16 A-2 Surgery 20
6 A-2 Current age 23
7 A-3 Birth 0
8 A-3 Brain cancer 30
19 A-3 Surgery 30
9 A-3 Lung cancer 33
9 A-3 Chemo 33
10 A-3 Current age 35
This attempt is pretty verbose and takes a few steps. WE can't do a simple pivot or index/column stacking because we need to modify one column with partial results from another. This requires splitting and appending.
Firstly, convert your dataframe into dtypes we can work with.
data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]
df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])
df[["Age at Diagnosis", "Chemo", "Surgery"]] = df[["Age at Diagnosis", "Chemo", "Surgery"]].astype(int)
Now we split the thing up into bits and pieces.
# I like making a copy or resetting an index so that
# pandas is not operating off a slice
df_chemo = df[df.Chemo > 0].copy()
df_surgery = df[df.Surgery > 0].copy()
# drop columns you don't need
df_chemo.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df_surgery.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df.drop(["Chemo", "Surgery"], axis=1, inplace=True)
# Set Chemo and Surgery Diagnosis
df_chemo.Diagnosis = "Chemo"
df_surgery.Diagnosis = "Surgery"
Then append everything together. You can do this because the column dimensions match.
df_new = df.append(df_chemo).append(df_surgery)
# make it look pretty
df_new.sort_values(["ID", "Age at Diagnosis"]).reset_index(drop=True)

Remove characters from a string in a dataframe

python beginner here. I would like to change some characters in a column in a dataframe under certain conditions.
The dataframe looks like this:
import pandas as pd
import numpy as np
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue (VS)', 'red', 'yellow (AG)', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data, index = ['0', '1', '2', '3'])
df
My goal is to replace in the column last name the space followed by the parenthesis and the two letters.
Blue instead of Blue (VS).
There is 26 letter variations that I have to remove but only one format: last_name followed by space followed by parenthesis followed by two letters followed by parenthesis.
From what I understood it should be that in regexp:
( \(..\)
I tried using str.replace but it only works for exact match and it replaces the whole value.
I also tried this:
df.loc[df['favorite_color'].str.contains(‘VS’), 'favorite_color'] = ‘random’
it also replaces the whole value.
I saw that I can only rewrite the value but I also saw that using this:
df[0].str.slice(0, -5)
I could remove the last 5 characters of a string containing my search.
In my mind I should make a list of the 26 occurrences that I want to be removed and parse through the column to remove those while keeping the text before. I searched for post similar to my problem but could not find a solution. Do you have any idea for a direction ?
You can use str.replace with pattern "(\(.*?\))"
Ex:
import pandas as pd
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue (VS)', 'red', 'yellow (AG)', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data, index = ['0', '1', '2', '3'])
df["newCol"] = df["favorite_color"].str.replace("(\(.*?\))", "").str.strip()
print( df )
Output:
age favorite_color grade name newCol
0 20 blue (VS) 88 Willard Morris blue
1 19 red 92 Al Jennings red
2 22 yellow (AG) 95 Omar Mullins yellow
3 21 green 70 Spencer McDaniel green

Categories