Create a new column with the value found in another DataFrame - python

I have two DataFrames:
df_components: list of unique components (ID, DESCRIPTION)
dataset: several rows and columns from a CSV (one of these columns contains the description of a component).
I need to create a new column in the dataset with the ID of the component according to the df_components.
I tried to do this way:
Creating the df_components and the ID column based on the index
components = dataset["COMPDESC"].unique()
df_components = pd.DataFrame(components, columns=['DESCRIPTION'])
df_components.sort_values(by='DESCRIPTION', ascending=True, inplace=True)
df_components.reset_index(drop=True, inplace=True)
df_components.index += 1
df_components['ID'] = df_components.index
Sample output:
DESCRIPTION ID
1 AIR BAGS 1
2 AIR BAGS:FRONTAL 2
3 AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE 3
4 AIR BAGS:SIDE/WINDOW 4
Create the COMP_ID in the dataset:
def create_component_id_column(row):
found = df_components[df_components['DESCRIPTION'] == row['COMPDESC']]
return found.ID if len(found.index) > 0 else None
dataset['COMP_ID'] = dataset.apply(lambda row: create_component_id_column(row), axis=1)
However this gives me the error ValueError: Wrong number of items passed 248, placement implies 1. Being 248 the number of items on df_components.
How can I create this new column with the ID from the item found on df_components?

Your logic seems overcomplicated. Since you are currently creating df_components from dataset, a better idea would be to use Categorical Data with dataset. This means you do not need to create df_components.
Step 1
Convert dataset['COMPDESC'] to categorical.
dataset['COMPDESC'] = dataset['COMPDESC'].astype('category')
Step 2
Create ID from categorical codes. Since categories are alphabetically sorted by default and indexing starts from 0, add 1 to the codes.
dataset['ID'] = dataset['COMPDESC'].cat.codes + 1
If you wish, you can extract the entire categorical mapping to a dictionary:
cat_map = dict(enumerate(dataset['COMPDESC'].cat.categories))
Remember that there always be a 1-offset if you want your IDs to begin at 1. In addition, you will need to update 'ID' explicitly every time 'DESCRIPTION' changes.
Advantages of using categorical data
Memory efficient: strings are only stored once.
Structure: you define the categories and have an automatic layer of data validation.
Consistent: since category to code mappings are always 1-to-1, they will always be consistent, even when new categories are added.

Related

How to extract a set of rows from one column of a dataframe using a variable column header?

I have a dataframe of multiple columns: the first is 'qty' and contains several (usually 4) replicates for several different quantities. The remaining columns represent the result of a test for the corresponding replicate at the corresponding quantity -- either a numeric value or a string ('TND' for target not detected, 'Ind' for indeterminate, etc.). Each of the columns (other than the first) represent the results for given 'targets', and there can be any number of targets in a given dataset. An example might be
qty target1 target2
1 TND TND
1 724 TND
1 TND TND
1 674 TND
5 1.4E+04 TND
5 9.2E+03 194
5 1.1E+04 TND
5 9.9E+03 TND
The ultimate goal is to get the probability of detecting each target at each concentration/quantity, so I initially calculated this using the function
def hitrate(qty,df):
t_s = df[df.qty == qty].result
t_s = t_s.apply(pd.to_numeric, args=('coerce',)).isna()
return (len(t_s)-t_s.sum())/len(t_s)
but this was when I only needed to evaluate probabilities for a single target. Before calling hitrate, I'd just ask the user what the header for their target column was, assign it to the variable tar, and use df = df.rename(columns={tar:'result'}).
Now that there are multiple targets, I can't use the hitrate function I wrote, as I need to call it in a loop such as
qtys = df['qty'].unique()
probs = np.zeros([len(qtys),len(targets)])
for i, tar in targets:
for idx, val in enumerate(qtys):
probs[idx,i] = hitrate(val,data)
But the hitrate function explicitly pulls the result/target column for a given quantity by using df[df.qty == qty].result. This no longer works, since the target column changes, and trying to use something like df[df.qty == qty].targets[i] or df[df.qty == qty].tar throws an error, presumably because you can't reference a dataframe column with a variable containing the column name (like you can with the column name directly, i.e. df.result).
In the end, I need to end up with two arrays or dataframes such as (with the above example table as reference):
Table for target_1:
qty probability
1 0.5
5 1.0
Table for target_2:
qty probability
1 0.0
5 0.25
I'm sorry if the question is confusing... If so, leave a comment and I'll try to be a bit clearer. It's been a long day. Any help would be appreciated!
The most basic way of accessing a column from a DataFrame is to use square brackets (like a dict):
df['some_column']
Attribute indexing is nice, but it doesn't work in many cases (Column names with spaces, for example).
So, try something like:
target = 'target1'
...
df[df.qty == qty][target]

Python: One hot encoding using reference list

I have data something like below:
CANDIDATE_ID
Job1_Skill1
12
conflict management
13
asset management
I want to add one hot encoded columns for each skill in table python and pandas based on the reference skill set(list).
for example if reference skill set given is
[conflict management, asset management, .net]
then my output should be something like below:
CANDIDATE_ID
Job1_Skill1
FP_conflict management
FP_ asset management
FP_.net
12
conflict management
1
0
0
13
asset management
0
1
0
I could do it comparing row by row but it does not seem to be an efficient approach. Can anyone suggest efficient way to do this using python?
get_dummies method gives output based on values in same column but I need to compare values for a specific reference list to encode i.e. get_dummies can give encoding only for FP_Conflict_management and FP_asset_management and not for FP_.net
and also get_dummies will be dynamic for each dataframe. I need to encode based on specific list of skills for every dataframe
but I need to compare the values with different column for encoding hence it cannot be used.
Here is simple workaround by adding reference list to the source data dataframe.
# set up source data
df_data = pd.DataFrame([[12,'conflict'],[13,'asset']],columns=['CANDIDATE_ID','Job1_Skill1'])
# define reference list with some unique ids
skills = [[999, 'conflict'],[999, 'asset'],[999, '.net']]
df_skills = pd.DataFrame(skills,columns=['CANDIDATE_ID','Job1_Skill1'])
# add reference data to main df
df_data_with_skills = df_data.append(df_skills, ignore_index=True)
# encode with pd.get_dummies
skills_dummies = pd.get_dummies(df_data_with_skills.Job1_Skill1)
result = pd.concat([df_data_with_skills, skills_dummies], axis=1)
# remove reference rows
result.drop(result[result['CANDIDATE_ID'] == 999].index, inplace = True)
print(result)

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Return String Similarity Scores between two String Columns - Pandas

I'm trying to build a search based results, where in I will have an input dataframe having one row and I want to compare with another dataframe having almost 1 million rows. I'm using a package called Record Linkage
However, I'm not able to handle typos. Lets say I have "HSBC" in my original data and the user types it as "HKSBC", I want to return "HSBC" results only. On comparing the string similarity distance with jarowinkler I get the following results:
from pyjarowinkler import distance
distance.get_jaro_distance("hksbc", "hsbc", winkler=True, scaling=0.1)
>> 0.94
However, I'm not able to give "HSBC" as an output, so I want to create a new column in my pandas dataframe where in I'll compute the string similarity scores and take that part of the score which has a score above a particular threshold.
Also, the main bottleneck is that I have almost 1 million data, so I need to compute it really fast.
P.S. I have no intentions of using fuzzywuzzy, preferable either of Jaccard or Jaro-Winkler
P.P.S. Any other ideas to handle typos for search based thing is also acceptable
I was able to solve it through record linkage only. So basically it does an initial indexing and generates candidate links (You can refer to the documentation on "SortedNeighbourhoodindexing" for more info), i.e. it does a multi-indexing between the two dataframes that needs to be compared, which I did manually.
So here is my code:
import recordlinkage
df['index'] = 1 # this will be static since I'll have only one input value
df['index_2'] = range(1, len(df)+1)
df.set_index(['index', 'index_2'], inplace=True)
candidate_links=df.index
df.reset_index(drop=True, inplace=True)
df.index = range(1, len(df)+1)
# once the candidate links has been generated you need to reset the index and compare with the input dataframe which basically has only one static index, i.e. 1
compare_cl = recordlinkage.Compare()
compare_cl.string('Name', 'Name', label='Name', method='jarowinkler') # 'Name' is the column name which is there in both the dataframe
features = compare_cl.compute(candidate_links,df_input,df) # df_input is the i/p df having only one index value since it will always have only one row
print(features)
Name
index index_2
1 13446 0.494444
13447 0.420833
13469 0.517949
Now I can give a filter like this:
features = features[features['Name'] > 0.9] # setting the threshold which will filter away my not-so-close names.
Then,
df = df[df['index'].isin(features['index_2'])
This will sort my results and give me the final dataframe which has a name score greater than a particular threshold set by the user.

Converting a column of credit ratings like AAA BB CC to a numeric category of AAA = 1, BB = .75 etc in python?

I have a column in a dataframe called 'CREDIT RATING' for a number of companies across rows. I need to assign a numerical category for ratings like AAA to DDD from 1(AAA) to 0(DDD). is there a quick simple way to do this and basically create a new column where i get numbers 1-0 by .1's? Thanks!
You could use replace:
df['CREDIT RATING NUMERIC'] = df['CREDIT RATING'].replace({'AAA':1, ... , 'DDD':0})
The easiest way is to simply create a dictionary mapping:
mymap = {"AAA":1.0, "AA":0.9, ... "DDD":0.0}
and then apply it to the dataframe:
df["CREDIT MAPPING"] = df["CREDIT RATING"].replace(mymap)
Ok, this was kinda though without nothing to work with but here we go:
# First getting a ratings list acquired from wikipedia than setting into a dataframe to replicate your scenario
ratings = ['AAA' ,'AA1' ,'AA2' ,'AA3' ,'A1' ,'A2' ,'A3' ,'BAA1' ,'BAA2' ,'BAA3' ,'BA1' ,'BA2' ,'BA3' ,'B1' ,'B2' ,'B3' ,'CAA' ,'CA' ,'C' ,'C' ,'E' ,'WR' ,'UNSO' ,'SD' ,'NR']
df_credit_ratings = pd.DataFrame({'Ratings_id':ratings})
df_credit_ratings = pd.concat([df_credit_ratings,df_credit_ratings]) # just to replicate duplicate records
# The set() command get the unique values
unique_ratings = set(df_credit_ratings['Ratings_id'])
number_of_ratings = len(unique_ratings) # counting how many unique there are
number_of_ratings_by_tenth = number_of_ratings/10 # Because from 0 to 1 by 0.1 to 0.1 there are 10 positions.
# the numpy's arange fills values in between from a range (first two numbers) and by which decimals (third number)
dec = list(np.arange(0.0, number_of_ratings_by_tenth, 0.1))
After this you'll need to mix the unique ratings to it's weigths:
df_ratings_unique = pd.DataFrame({'Ratings_id':list(unique_ratings)}) # list so it gets one value per row
EDIT: as Thomas suggested in another answer's comment, this sort probably wont fit you because it won't be the real order of importance of the ratings. So you'll probably need to first create a dataframe with them already in order and no neet to sort.
df_ratings_unique.sort_values(by='Ratings_id', ascending=True, inplace=True) # sorting so it matches the order of our weigths above.
Resuming the solution:
df_ratings_unique['Weigth'] = dec # adding the weigths to the DF
df_ratings_unique.set_index('Ratings_id', inplace=True) # setting the Rantings as index to map the values bellow
# now this is the magic, we're creating a new column at the original Dataframe and we'll map according to the `Ratings_id` by our unique dataframe
df_credit_ratings['Weigth'] = df_credit_ratings['Ratings_id'].map(df_ratings_unique.Weigth)

Categories