This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 2 years ago.
I have a dataset with some customer information, with one column containing device codes (identifying the device used). I need to translate this codes into actual model names.
I also have a second table with a column holding device codes (same as the first table) and another column holding the corresponding model names.
I know it may seem trivial, I have managed to translate codes into models by using a for loop, .loc method and conditional substitution, but I'm looking for a more structured solution.
Here's an extract of the data.
df = pd.DataFrame(
{
'Device_code': ['SM-A520F','SM-A520F','iPhone9,3','LG-H860', 'WAS-LX1A', 'WAS-LX1A']
}
)
transcription_table=pd.DataFrame(
{
'Device_code': ['SM-A520F','SM-A520X','iPhone9,3','LG-H860', 'WAS-LX1A', 'XT1662','iPhone11,2'],
'models': ['Galaxy A5(2017)','Galaxy A5(2017)','iPhone 7','LG G5', 'P10 lite', 'Motorola Moto M','iPhone XS']
}
)
Basically I need to obtain the explicit model of the device every time there's a match between the device_code column of the two tables, and overwrite the device_code of the first table (df) with the actual model name (or, it can be written on the same row into a newly created column, this is less of a problem).
Thank you for your help.
Turn your transcription_table into an actual mapping (aka a dictionary) and then use Series.map:
transcription_dict = dict(transcription_table.values)
df['models'] = df['Device_code'].map(transcription_dict)
print(df)
output:
Device_code models
0 SM-A520F Galaxy A5(2017)
1 SM-A520F Galaxy A5(2017)
2 iPhone9,3 iPhone 7
3 LG-H860 LG G5
4 WAS-LX1A P10 lite
5 WAS-LX1A P10 lite
This is just one solution:
# Dictionary that maps device codes to models
mapping = transcription_table.set_index('Device_code').to_dict()['models']
# Apply mapping to a new column in the dataframe
# If no match is found, None will be filled in
df['Model'] = df['Device_code'].apply(lambda x: mapping.get(x))
Related
Im working on numpy, pandas and need to "merge" rows. I have column martial-status and there are things like this:
'Never-married', 'Divorced', 'Separated', 'Windowed'
and:
'Married-civ-spouse','Married-spouse-absent', 'Married-AF-spouse'
Im wondering how to merge them to just 2 rows, for the first 4 to single and for the second one's in relationship. I need it for one hot encoding later.
And for sample output the martial-status should be just single or in relationship adequately to what i mention before
You can use pd.Series.map to convert certain values to other. For this you need a dictionary, that assigns each value with a new value. The values not presented in the dictionary will be replaced with NaN
married_map = {
status:'Single'
for status in ['Never-married', 'Divorced', 'Separated', 'Widowed']}
married_map.update({
status:'In-relationship'
for status in ['Married-civ-spouse','Married-spouse-absent', 'Married-AF-spouse']})
df['marital-status'].map(married_map)
I have data something like below:
CANDIDATE_ID
Job1_Skill1
12
conflict management
13
asset management
I want to add one hot encoded columns for each skill in table python and pandas based on the reference skill set(list).
for example if reference skill set given is
[conflict management, asset management, .net]
then my output should be something like below:
CANDIDATE_ID
Job1_Skill1
FP_conflict management
FP_ asset management
FP_.net
12
conflict management
1
0
0
13
asset management
0
1
0
I could do it comparing row by row but it does not seem to be an efficient approach. Can anyone suggest efficient way to do this using python?
get_dummies method gives output based on values in same column but I need to compare values for a specific reference list to encode i.e. get_dummies can give encoding only for FP_Conflict_management and FP_asset_management and not for FP_.net
and also get_dummies will be dynamic for each dataframe. I need to encode based on specific list of skills for every dataframe
but I need to compare the values with different column for encoding hence it cannot be used.
Here is simple workaround by adding reference list to the source data dataframe.
# set up source data
df_data = pd.DataFrame([[12,'conflict'],[13,'asset']],columns=['CANDIDATE_ID','Job1_Skill1'])
# define reference list with some unique ids
skills = [[999, 'conflict'],[999, 'asset'],[999, '.net']]
df_skills = pd.DataFrame(skills,columns=['CANDIDATE_ID','Job1_Skill1'])
# add reference data to main df
df_data_with_skills = df_data.append(df_skills, ignore_index=True)
# encode with pd.get_dummies
skills_dummies = pd.get_dummies(df_data_with_skills.Job1_Skill1)
result = pd.concat([df_data_with_skills, skills_dummies], axis=1)
# remove reference rows
result.drop(result[result['CANDIDATE_ID'] == 999].index, inplace = True)
print(result)
I've seen a large number of similar questions but nothing quite answers what I am looking to do.
I have two dataframes
Conn_df that contains names and company details manually entered (e.g. Conn_df["Name", "Company_name", "Company_Address"]
Cleanse_df that contains cleaned up company names (e.g. Cleanse_df["Original_Company_Name", "Cleanse_Company_Name"]
The data for both is held in csv files that are imported into the script.
I want to change the company details in Conn_df.Company_Name using the values in Cleanse_df, where the Conn_df.Company_Name equals the Cleanse_df.Original_Company_Name and is replaced by Cleanse_df.Cleanse_Company_Name.
I have tried:
Conn_df["Company"] = Conn_df["Company"].replace(Conn_df["Company"], Cleanse_df["Cleansed"]) but got
replace() takes no keyword arguments
I also tried:
Conn_df["Company"] = Conn_df["Company"].map(Cleanse_df.set_index("Original")["Cleansed"]) but got
Reindexing only valid with uniquely valued Index objects
Any suggestions on how to get the values to be replaced. I would note that both dataframes run to many tens of thousands of rows, so creating a manual list is not possible.
I think you want something along the lines of this:
conn_df = pd.DataFrame({'Name':['Mac','K','Hutt'],
'Company_name':['McD','KFC','PH'],
'Company_adress':['street1','street2','street4']})
cleanse_df = pd.DataFrame({'Original_Company_Name':['McD'],'Cleanse_Company_Name':
['MacDonalds']})
cleanse_df = cleanse_df.rename(columns={'Original_Company_Name':'Company_name'})
merged_df = conn_df.merge(cleanse_df,on='Company_name',how='left')
merged_df['Cleanse_Company_Name'].fillna(merged_df['Company_name'],inplace=True)
final_df = merged_df[['Name','Company_adress','Cleanse_Company_Name']]\
.rename(columns={'Cleanse_Company_Name':'Company_name'})
This would return:
Name Company_adress Company_name
0 Mac street1 MacDonalds
1 K street2 KFC
2 Hutt street4 PH
You merge the two dataframes and then keep the replaced new value, if there is no value to replace the name then the name will just stay the same, this is done by the fillna command.
I'm using pandas, and have two data frames:
df1:
id date status rpbid rpfid
1 d1 closed null 10
2 d2 closed null 11
3 d3 closed null null
and df2:
id date status rpbid rpfid
10 d10 updated 1 null
11 d11 updated 2 9
9 d9 updated 11 null
The idea is that I would like to handle 2 cases:
1. where the closed record was the first and final record for that instance (id 3 in df1),
2. where the closed record has one more updated records linked in df2.
rfbid and rpbid are for replacedbyid and replacementforid
So the resulting df would be:
id date status rpbid rpfid id2 date2 rpbid2 rpfip2
1 d1 closed null 10 10 d10 1 null
2 d2 closed null 11 9 d9 11 null
3 d3 closed null null null null null null
So far, I've tried doing a first left join on df1 and df2, to get the all of the first recursive joins, I then tried using a loop to check whether rpbid2 was null, if it wasn't I looked back at df1 for the rpdid2 value in the id column of df2, I would then like to update that second half of the merged data frame to be the next step in the join where applicable.
Here is the original code: I haven't been able to get it not error
import pandas as pd
df = pd.read_csv(filename)
df_initial = df.loc[df['LetterStatus']=='CLOSED']
dfx = df.loc[df['LetterStatus']=='UPDATED']
df_merged = pd.merge(df_initial,dfx,how='left',left_on='ReferenceNumber',right_on='ReplacedByRefNumber')
df_copy = df_merged
for row in range(len(df_merged)):
if len(str(df_merged.iloc[row]['ReplacedByRefNumber_y'])) > 1:
row_slice = dfx.iloc[['ReferenceNumber']==df_merged[row['ReplacedByRefNumber_y']]
if row_slice.size == 0:
df_merged.iloc[row]['ReplacedByRefNumber_y']='Unknown'
df_copy.iloc[row]['ReplacedByRefNumber_y']='Unknown'
else:
df_copy.iloc[row][24:0]=row_slice
print(df_copy)
For more context; if the replacedbyID is null and the status is 'updated', that means the it was first record for that given order.
Disclaimer: instead of trying to find out to the bottom of question a way to obtain that data frame you want, my intention with this answer is rather to make sure you are absolutely certain of what you are doing, and maybe help you out structure your data a bit better.
For what I see, you have two data frames that have reciprocal bindings in each of them to update their content; problem is you are showing an example in which a row that is to update another element in the other data frame, is as well to be updated.
You are making a mess of the data structure and row identification. I am not very clear about whether the id references in each data frame refer to a row in their own data frame or in the other; ID's are not sorted by inclusion order. When you make your joint data frame, you have recursively included a column to still replace data again, making your data frame grow horizontally with no actual purpose.
I think that you have tried to make your own way of updating data and now you come across problems that you would not otherwise had if you used a data structure that is more common and already thought through to be scalable and easy to manipulate.
If you want to update data, best way is to use a data model that has a table inside a database correctly formatted (you can still use pandas; data frames can be your tables). You can update data when an update request comes right to the place where the to-be-updated content is, instead of keeping a separate record of updates, that at the same time have update requests themselves. That is very messy. If you want to keep a record of updates, you have to have a table that is constantly being updated, and then another table that shows record of each already-executed manipulation into the table. You can store in the latter the previous value and the value it was updated for.
You have to name your data frames properly, and when you make a reference to an ID in another data frame, that field's name has to inherently indicate that the reference is to that data frame.
You can include dates in the field, you don't have to make references to another table. That does not look good. Just use the datetime.datetime module and dump the object into the data frame; Python takes care of the rest.
Variable names too should be somewhat self-explanatory: instead of using rpbid, and having to explain to everyone that means replacedbyid, just use replaced_by_id (note the underscores to separate words).
I have two DataFrames:
df_components: list of unique components (ID, DESCRIPTION)
dataset: several rows and columns from a CSV (one of these columns contains the description of a component).
I need to create a new column in the dataset with the ID of the component according to the df_components.
I tried to do this way:
Creating the df_components and the ID column based on the index
components = dataset["COMPDESC"].unique()
df_components = pd.DataFrame(components, columns=['DESCRIPTION'])
df_components.sort_values(by='DESCRIPTION', ascending=True, inplace=True)
df_components.reset_index(drop=True, inplace=True)
df_components.index += 1
df_components['ID'] = df_components.index
Sample output:
DESCRIPTION ID
1 AIR BAGS 1
2 AIR BAGS:FRONTAL 2
3 AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE 3
4 AIR BAGS:SIDE/WINDOW 4
Create the COMP_ID in the dataset:
def create_component_id_column(row):
found = df_components[df_components['DESCRIPTION'] == row['COMPDESC']]
return found.ID if len(found.index) > 0 else None
dataset['COMP_ID'] = dataset.apply(lambda row: create_component_id_column(row), axis=1)
However this gives me the error ValueError: Wrong number of items passed 248, placement implies 1. Being 248 the number of items on df_components.
How can I create this new column with the ID from the item found on df_components?
Your logic seems overcomplicated. Since you are currently creating df_components from dataset, a better idea would be to use Categorical Data with dataset. This means you do not need to create df_components.
Step 1
Convert dataset['COMPDESC'] to categorical.
dataset['COMPDESC'] = dataset['COMPDESC'].astype('category')
Step 2
Create ID from categorical codes. Since categories are alphabetically sorted by default and indexing starts from 0, add 1 to the codes.
dataset['ID'] = dataset['COMPDESC'].cat.codes + 1
If you wish, you can extract the entire categorical mapping to a dictionary:
cat_map = dict(enumerate(dataset['COMPDESC'].cat.categories))
Remember that there always be a 1-offset if you want your IDs to begin at 1. In addition, you will need to update 'ID' explicitly every time 'DESCRIPTION' changes.
Advantages of using categorical data
Memory efficient: strings are only stored once.
Structure: you define the categories and have an automatic layer of data validation.
Consistent: since category to code mappings are always 1-to-1, they will always be consistent, even when new categories are added.