Python: One hot encoding using reference list

Python: One hot encoding using reference list - python

I have data something like below:
CANDIDATE_ID
Job1_Skill1
12
conflict management
13
asset management
I want to add one hot encoded columns for each skill in table python and pandas based on the reference skill set(list).
for example if reference skill set given is
[conflict management, asset management, .net]
then my output should be something like below:
CANDIDATE_ID
Job1_Skill1
FP_conflict management
FP_ asset management
FP_.net
12
conflict management
1
0
0
13
asset management
0
1
0
I could do it comparing row by row but it does not seem to be an efficient approach. Can anyone suggest efficient way to do this using python?
get_dummies method gives output based on values in same column but I need to compare values for a specific reference list to encode i.e. get_dummies can give encoding only for FP_Conflict_management and FP_asset_management and not for FP_.net
and also get_dummies will be dynamic for each dataframe. I need to encode based on specific list of skills for every dataframe
but I need to compare the values with different column for encoding hence it cannot be used.

Here is simple workaround by adding reference list to the source data dataframe.
# set up source data
df_data = pd.DataFrame([[12,'conflict'],[13,'asset']],columns=['CANDIDATE_ID','Job1_Skill1'])
# define reference list with some unique ids
skills = [[999, 'conflict'],[999, 'asset'],[999, '.net']]
df_skills = pd.DataFrame(skills,columns=['CANDIDATE_ID','Job1_Skill1'])
# add reference data to main df
df_data_with_skills = df_data.append(df_skills, ignore_index=True)
# encode with pd.get_dummies
skills_dummies = pd.get_dummies(df_data_with_skills.Job1_Skill1)
result = pd.concat([df_data_with_skills, skills_dummies], axis=1)
# remove reference rows
result.drop(result[result['CANDIDATE_ID'] == 999].index, inplace = True)
print(result)

Related

Map columns' values to another column [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 2 years ago.
I have a dataset with some customer information, with one column containing device codes (identifying the device used). I need to translate this codes into actual model names.
I also have a second table with a column holding device codes (same as the first table) and another column holding the corresponding model names.
I know it may seem trivial, I have managed to translate codes into models by using a for loop, .loc method and conditional substitution, but I'm looking for a more structured solution.
Here's an extract of the data.
df = pd.DataFrame(
{
'Device_code': ['SM-A520F','SM-A520F','iPhone9,3','LG-H860', 'WAS-LX1A', 'WAS-LX1A']
}
)
transcription_table=pd.DataFrame(
{
'Device_code': ['SM-A520F','SM-A520X','iPhone9,3','LG-H860', 'WAS-LX1A', 'XT1662','iPhone11,2'],
'models': ['Galaxy A5(2017)','Galaxy A5(2017)','iPhone 7','LG G5', 'P10 lite', 'Motorola Moto M','iPhone XS']
}
)
Basically I need to obtain the explicit model of the device every time there's a match between the device_code column of the two tables, and overwrite the device_code of the first table (df) with the actual model name (or, it can be written on the same row into a newly created column, this is less of a problem).
Thank you for your help.

Turn your transcription_table into an actual mapping (aka a dictionary) and then use Series.map:
transcription_dict = dict(transcription_table.values)
df['models'] = df['Device_code'].map(transcription_dict)
print(df)
output:
Device_code models
0 SM-A520F Galaxy A5(2017)
1 SM-A520F Galaxy A5(2017)
2 iPhone9,3 iPhone 7
3 LG-H860 LG G5
4 WAS-LX1A P10 lite
5 WAS-LX1A P10 lite

This is just one solution:
# Dictionary that maps device codes to models
mapping = transcription_table.set_index('Device_code').to_dict()['models']
# Apply mapping to a new column in the dataframe
# If no match is found, None will be filled in
df['Model'] = df['Device_code'].apply(lambda x: mapping.get(x))

Return String Similarity Scores between two String Columns - Pandas

I'm trying to build a search based results, where in I will have an input dataframe having one row and I want to compare with another dataframe having almost 1 million rows. I'm using a package called Record Linkage
However, I'm not able to handle typos. Lets say I have "HSBC" in my original data and the user types it as "HKSBC", I want to return "HSBC" results only. On comparing the string similarity distance with jarowinkler I get the following results:
from pyjarowinkler import distance
distance.get_jaro_distance("hksbc", "hsbc", winkler=True, scaling=0.1)
>> 0.94
However, I'm not able to give "HSBC" as an output, so I want to create a new column in my pandas dataframe where in I'll compute the string similarity scores and take that part of the score which has a score above a particular threshold.
Also, the main bottleneck is that I have almost 1 million data, so I need to compute it really fast.
P.S. I have no intentions of using fuzzywuzzy, preferable either of Jaccard or Jaro-Winkler
P.P.S. Any other ideas to handle typos for search based thing is also acceptable

I was able to solve it through record linkage only. So basically it does an initial indexing and generates candidate links (You can refer to the documentation on "SortedNeighbourhoodindexing" for more info), i.e. it does a multi-indexing between the two dataframes that needs to be compared, which I did manually.
So here is my code:
import recordlinkage
df['index'] = 1 # this will be static since I'll have only one input value
df['index_2'] = range(1, len(df)+1)
df.set_index(['index', 'index_2'], inplace=True)
candidate_links=df.index
df.reset_index(drop=True, inplace=True)
df.index = range(1, len(df)+1)
# once the candidate links has been generated you need to reset the index and compare with the input dataframe which basically has only one static index, i.e. 1
compare_cl = recordlinkage.Compare()
compare_cl.string('Name', 'Name', label='Name', method='jarowinkler') # 'Name' is the column name which is there in both the dataframe
features = compare_cl.compute(candidate_links,df_input,df) # df_input is the i/p df having only one index value since it will always have only one row
print(features)
Name
index index_2
1 13446 0.494444
13447 0.420833
13469 0.517949
Now I can give a filter like this:
features = features[features['Name'] > 0.9] # setting the threshold which will filter away my not-so-close names.
Then,
df = df[df['index'].isin(features['index_2'])
This will sort my results and give me the final dataframe which has a name score greater than a particular threshold set by the user.

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376

Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

Create a new column with the value found in another DataFrame

I have two DataFrames:
df_components: list of unique components (ID, DESCRIPTION)
dataset: several rows and columns from a CSV (one of these columns contains the description of a component).
I need to create a new column in the dataset with the ID of the component according to the df_components.
I tried to do this way:
Creating the df_components and the ID column based on the index
components = dataset["COMPDESC"].unique()
df_components = pd.DataFrame(components, columns=['DESCRIPTION'])
df_components.sort_values(by='DESCRIPTION', ascending=True, inplace=True)
df_components.reset_index(drop=True, inplace=True)
df_components.index += 1
df_components['ID'] = df_components.index
Sample output:
DESCRIPTION ID
1 AIR BAGS 1
2 AIR BAGS:FRONTAL 2
3 AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE 3
4 AIR BAGS:SIDE/WINDOW 4
Create the COMP_ID in the dataset:
def create_component_id_column(row):
found = df_components[df_components['DESCRIPTION'] == row['COMPDESC']]
return found.ID if len(found.index) > 0 else None
dataset['COMP_ID'] = dataset.apply(lambda row: create_component_id_column(row), axis=1)
However this gives me the error ValueError: Wrong number of items passed 248, placement implies 1. Being 248 the number of items on df_components.
How can I create this new column with the ID from the item found on df_components?

Your logic seems overcomplicated. Since you are currently creating df_components from dataset, a better idea would be to use Categorical Data with dataset. This means you do not need to create df_components.
Step 1
Convert dataset['COMPDESC'] to categorical.
dataset['COMPDESC'] = dataset['COMPDESC'].astype('category')
Step 2
Create ID from categorical codes. Since categories are alphabetically sorted by default and indexing starts from 0, add 1 to the codes.
dataset['ID'] = dataset['COMPDESC'].cat.codes + 1
If you wish, you can extract the entire categorical mapping to a dictionary:
cat_map = dict(enumerate(dataset['COMPDESC'].cat.categories))
Remember that there always be a 1-offset if you want your IDs to begin at 1. In addition, you will need to update 'ID' explicitly every time 'DESCRIPTION' changes.
Advantages of using categorical data
Memory efficient: strings are only stored once.
Structure: you define the categories and have an automatic layer of data validation.
Consistent: since category to code mappings are always 1-to-1, they will always be consistent, even when new categories are added.

parsing CSV to pandas dataframes (one-to-many unmunge)

I have a csv file imported to a pandas dataframe. It probably came from a database export that combined a one-to-many parent and detail table. The format of the csv file is as follows:
header1, header2, header3, header4, header5, header6
sample1, property1,,,average1,average2
,,detail1,detail2,,
,,detail1,detail2,,
,,detail1,detail2,,
sample2, ...
,,detail1,detail2,,
,,detail1,detail2,,
...
(i.e. line 0 is the header, line 1 is record 1, lines 2 through n are details, line n+1 is record 2 and so on...)
What is the best way to extricate (renormalize?) the details into separate DataFrames that can be referenced using values in the sample# records? The number of each subset of details are different for each sample.
I can use:
samplelist = df.header2[pd.notnull(df.header2)]
to get the starting index of each sample so that I can grab samplelist.index[0] to samplelist.index[1] and put it in a smaller dataframe. Detail records by themselves have no reference to which sample they came from, so that has to be inferred from the order of the csv file (notice that there is no intersection of filled/empty fields in my example).
Should I make a list of dataframes, a dict of dataframes, or a panel of dataframes?
Can I somehow create variables from the sample1 record fields and somehow attach them to each dataframe that has only detail records (like a collection of objects that have several scalar members and one dataframe each)?
Eventually I will create statistics on data from each detail record grouping and plot them against values in the sample records (e.g. sampletype, day or date, etc. vs. mystatistic). I will create intermediate Series to also be attached to the sample grouping like a kernel density estimation PDF or histogram.
Thanks.

You can use the fact that the first column seems to be empty unless it's a new sample record to .fillna(method='ffill') and then .groupby('header1') to get all the separate groups. On these, you can calculate statistics right away or store as separate DataFrame. High level sketch as follows:
df.header1 = df.header1.fillna(method='ffill')
for sample, data in df.groupby('header1'):
print(sample) # access to sample name
data = ... # process sample records

The answer above got me going in the right direction. With further work, the following was used. It turns out I needed to use two columns as a compound key to uniquely identify samples.
df.header1 = df.header1.fillna(method='ffill')
df.header2 = df.header2.fillna(method='ffill')
grouped = df.groupby(['header1','header2'])
samplelist = []
dfParent = pd.DataFrame()
dfDetail = pd.DataFrame()
for sample, data in grouped:
samplelist.append(sample)
dfParent = dfParent.append(grouped.get_group(sample).head(n=1), ignore_index=True)
dfDetail = dfDetail.append(data[1:], ignore_index=True)
dfParent = dfParent.drop(['header3','header4',etc...]) # remove columns only used in
# detail records
dfDetail = dfDetail.drop(['header5','header6',etc...]) # remove columns only used once
# per sample
# Now details can be extracted by sample number in the sample list
# (e.g. the first 10 for sample 0)
samplenumber = 0
dfDetail[
(dfDetail['header1'] == samplelist[samplenumber][0]) &
(dfDetail['header2'] == samplelist[samplenumber][1])
].header3[:10]
Useful links were:
Pandas groupby and get_group
Pandas append to DataFrame

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.