How do to fuzzy matching on excel file using Pandas?

How do to fuzzy matching on excel file using Pandas? - python

I have a table called account with two columns - ID & NAME. ID is a hash which is unique but NAME is a string which might have duplicates.
I'm trying to write a python script to read this excel file and match 0-3 similar NAME values, but I just cannot seem to get it to work.
Could someone help out? Thanks
import pandas as pd
from fuzzywuzzy import fuzz
import difflib
def get_spr(row):
d = name1.apply(lambda x: (fuzz.ratio(x['NAME'], row['NAME']) * 0 if row['ID'] == x['ID'] else 1), axis=1)
d = d[d>= 60]
if len(d) == 0:
v = ['']*2
else:
v = name1.ix[d.idxmax(),['ID' , 'NAME']].values
return pd.Series(v, index=['ID', 'NAME'])
def score(tablerow):
d = name1.apply(lambda x: fuzz.ratio(x['NAME'],tablerow['NAME']) * (0 if x['ID']==tablerow['ID'] else 1), axis=1)
d = d[d>90]
if len(d) == 0:
v = [''] * 2
else:
v = name1.ix[d.order(ascending=False).head(3).index, ['ID' , 'NAME']].values
return pd.DataFrame(v, index=['ID', 'NAME'])
account = "account_test.xlsx"
xl_acc1 = pd.ExcelFile(account)
xl_acc2 = pd.ExcelFile(account)
acc1 = xl_acc1.parse(xl_acc1.sheet_names[0])
acc2 = xl_acc2.parse(xl_acc2.sheet_names[0])
name1 = acc1[pd.notnull(acc1['NAME'])]
name2 = acc2[pd.notnull(acc2['NAME'])]
print 'Doing Fuzzy Matching'
name2= pd.concat((name2,name2.apply(get_spr, axis=1)), axis=1)
name2.to_excel(pd.ExcelWriter('res.xlsx'),'acc')
Any help would be much appreciated!
The file has rows like this:-
ID NAME
0016F00001c7GDZQA2 Daniela Abriani
0016F00001c7GPnQAM Daniel Abriani
0016F00001c7JRrQAM Nisha Well
0016F00001c7Jv8QAE Katherine
0016F00001c7cXiQAI Katerine
0016F00001c7dA3QAI Katherin
0016F00001c7kHyQAI Nursing and Midwifery Council Research Office
0016F00001c8G8OQAU Nisa Well
Expected (output dataframe) would be something like:
ID NAME ID2 NAME2
<hash1> katherine <hash2> katerine
<hash1> katherine <hash3> katherin
<hash4> Nisa Well <hash5> Nisha Well
Issue: The above code just reproduces the input as the output saved file without actually concatenating any matches.

I don't think you need to do this in pandas. Here is my sloppy solution but it gets your desired output using a dictionary.
from fuzzywuzzy import process
df = pd.DataFrame([
['0016F00001c7GDZQA2', 'Daniela Abriani'],
['0016F00001c7GPnQAM', 'Daniel Abriani'],
['0016F00001c7JRrQAM', 'Nisha Well'],
['0016F00001c7Jv8QAE', 'Katherine'],
['0016F00001c7cXiQAI', 'Katerine'],
['0016F00001c7dA3QAI', 'Katherin'],
['0016F00001c7kHyQAI', 'Nursing and Midwifery Council Research Office'],
['0016F00001c8G8OQAU', 'Nisa Well']],
columns=['ID', 'NAME'])
get unique hashes in to a dictionary.
hashdict = dict(zip(df['ID'], df['NAME']))
define a function checkpair. You'll need it to remove reciprocal hash pairs. This method will add (hash1, hash2) and (hash2, hash1), but I think you only want to keep one of those pairs:
def checkpair (a,b,l):
for x in l:
if (a,b) == (x[2],x[0]):
l.remove(x)
Now iterate through hashdict.items() finding the top 3 matches along the way. The fuzzywuzzy docs detail the process method.
matches = []
for k,v in hashdict.items():
#see docs for extract -- 4 because you are comparing a name to itself
top3 = process.extract(v, hashdict, limit=4)
#remove the hashID compared to itself
for h in top3:
if k == h[2]:
top3.remove(h)
#append tuples to the list "matches" if it meets a score criteria
[matches.append((k, v, x[2], x[0], x[1])) for x in top3 if x[1] > 60] #change score?
#remove reciprocal pairs
[checkpair(m[0], m[2], matches) for m in matches]
df = pd.DataFrame(matches, columns=['id1', 'name1', 'id2', 'name2', 'score'])
# write to file
writer = pd.ExcelWriter('/path/to/your/file.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Output:
id1 name1 id2 name2 score
0 0016F00001c7JRrQAM Nisha Well 0016F00001c8G8OQAU Nisa Well 95
1 0016F00001c7GPnQAM Daniel Abriani 0016F00001c7GDZQA2 Daniela Abriani 97
2 0016F00001c7Jv8QAE Katherine 0016F00001c7dA3QAI Katherin 94
3 0016F00001c7Jv8QAE Katherine 0016F00001c7cXiQAI Katerine 94
4 0016F00001c7dA3QAI Katherin 0016F00001c7cXiQAI Katerine 88

Related

Python script to sum values according to conditions in a loop

I need to sum the value contained in a column (column 9) if a condition is satisfied: the condition is that it needs to be a pair of individuals (column 1 and column 3), whether they are repeated or not.
My input file is made this way:
Sindhi_HGDP00171 0 Tunisian_39T 0 1 120437718 147097266 3.02 7.111
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 3.468
Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 4.468
IBS_HG01768 2 Moroccan_MRA46 1 1 34186193 36027711 30.46 3.108
IBS_HG01710 1 Sardinian_HGDP01065 2 1 246117191 249120684 7.53 3.258
IBS_HG01768 2 Moroccan_MRA46 2 1 34186193 37320967 43.4 4.418
Therefore for instance, I would need the value of column 9 for each pair to be summed. Some of these pairs appear multiple time, in this case I would need the sum of value in column 9 betweem IBS_HG01768 and Moroccan_MRA46, and the sum of the value between Sindhi_HGDP00183 and Sindhi_HGDP00206. Some of these pairs are not repeated but I still need them to appear in the final results.
What I manage so far is to sum by group (population), so I sum column 9 value by pair of population like Sindhi and Tunisian for instance. I need to do the sum by pairs of Individuals.
My script is this:
import pandas as pd
import numpy as np
import itertools
# defines columns names
cols = ['ID1', 'HAP1', 'ID2', 'HAP2', 'CHR', 'STARTPOS', 'ENDPOS', 'LOD', 'IBDLENGTH']
# loads data (the file needs to be in the same folder where the script is)
data = pd.read_csv("./Roma_Ref_All_sorted.txt", sep = '\t', names = cols)
# removes the sample ID for ID1/ID2 columns and places it in two dedicated columns
data[['ID1', 'ID1_samples']] = data['ID1'].str.split('_', expand = True)
data[['ID2', 'ID2_samples']] = data['ID2'].str.split('_', expand = True)
# gets the groups list from both ID columns...
groups_id1 = list(data.ID1.unique())
groups_id2 = list(data.ID2.unique())
groups = list(set(groups_id1 + groups_id2))
# ... and all the possible pairs
group_pairs = [i for i in itertools.combinations(groups, 2)]
# subsets the pairs having Roma
group_pairs_roma = [x for x in group_pairs if ('Roma' in x[0] and x[0] != 'Romanian') or
('Roma' in x[1] and x[1] != 'Romanian')]
# preapres output df
result = pd.DataFrame(columns = ['ID1', 'ID2', 'IBD_sum'])
# loops all the possible pairs and computes the sum of IBD length
for idx, group_pair in enumerate(group_pairs_roma):
id1 = group_pair[0]
id2 = group_pair[1]
ibd_sum = round(data.loc[((data['ID1'] == id1) & (data['ID2'] == id2)) |
((data['ID1'] == id2) & (data['ID2'] == id1)), 'IBDLENGTH'].sum(),3)
result.loc [idx, ['ID1', 'ID2', 'IBD_sum']] = [id1, id2, ibd_sum]
# saves results
result.to_csv("./groups_pairs_sum_IBD.txt", sep = '\t', index = False)
My current output is something like this:
ID1 ID2 IBD_sum
Sindhi IBS 3.275
Sindhi Moroccan 74.201
Sindhi Sindhi 119.359
While I need something like:
ID1 ID2 IBD_sum
Sindhi_individual1 Moroccan_individual1 3.275
Sindhi_individual2 Moroccan_individual2 5.275
Sindhi_individual3 IBS_individual1 4.275
I have tried by substituting one line in my code, by writing
groups_id1 = list(data.ID1_samples.unique())
groups_id2 = list(data.ID2_samples.unique())
and later
ibd_sum = round(data.loc[((data['ID1_samples'] == id1) & (data['ID2_samples'] == id2)) |
((data['ID1_samples'] == id2) & (data['ID2_samples'] == id1)), 'IBDLENGTH'].sum(),3)
Which in theory should work because I set the individuals as pairs instead of populations as pairs, but the output was empty. What could I do to edit the code for what I need?

I have solved the problem on my own but using R language.
This is the code:
ibd <- read.delim("input.txt", sep='\t')
ibd_sum_indv <- ibd %>%
group_by(ID1, ID2) %>%
summarise(SIBD = sum(IBDLENGTH),
NIBD = n()) %>%
ungroup()

Fill cell values by index and column in pandas from external list

I am trying to populate a dataframe by looking up values in a list of lists and trying to find a match for column/index. I found this post and thought I could modify it for my needs. fill in entire dataframe cell by cell based on index AND column names?. He fills in his pandas dataframe using the edit_distance function. I am currently trying to modify that function so it outputs actual data.
My data set looks something like this but with many more values:
Data = [Product Number, Date, Quantity]
[X1 , 2018-01, 2]
[X1, , 2018-02, 4]
[X1, , 2018-03, 7]
[X2, , 2018-01, 3]
[X3, , 2018-02, 5]
[X3, , 2018-03, 6]
Expected Outcome: Apologies for crude representation
DF = 2018-01 2018-02- 2018-03
X1 2 4 7
X2 3
X3 5 6
I deduped all of the product numbers and dates in my list of lists and set them equal to the below, just as he did in the referenced stack question.
series_rows = pd.Series(prod_deduped)
series_cols = pd.Series(dates_deduped)
His code for mapping all of the cells:
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
The part starting with edit_distance is a function that returns a value based on the inputs for x,y. I created my own function that would loop through a list of lists and return a value based on a match.
def return_value(s1, s2, list_of_lists, starting_point_in_case_of_header):
for row in list_of_lists[starting_point_in_case_of_header:]:
result = ''
product = row[0]
date = row[1]
quantity = row[2]
#for prod in product:
if product == s1 and date == s2:
result = quantity
return result
I get a match on row1, column1 but everything else is blank which makes me think that I really need to be looping through s1 or s2 before everything else. Any help would be appreciated. Thanks!
EDIT: Here is my most recent attempt at trying to loop through s1 and s2 but this just errors out saying my list index is out of range. I think I am on the right track though.
def return_value(s1, s2, list_of_lists, starting_point_in_case_of_header):
for y in enumerate(s2):
result_ = []
for x in enumerate(s1):
for row in list_of_lists[starting_point_in_case_of_header:]:
product = row[0]
date = row[1]
quantity = row[2]
if product == x and date == y:
result_.append(quantity)
result = result_
return result[-1]
My final code to put everything all together:
result_df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: return_value(x, y, sorted_deduped_list, 0))))

Feature extraction from the training data

I have a training data like below which have all the information under a single column. The data set has above 300000 data.
id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games; 1
3 name=David;age=12;1:=High School;2:=Cricketer;native=america; 2
4 name=George;age=11;1:=High School;2:=Carpenter;married=yes 2
.
.
300000 name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No 3
Now i need to convert this training data like below
id name age 1 2 Interest married Smoker
1 John Matthew 25 Post Graduate Football Player Nan Nan Nan
2 Mark clark 21 Under Graduate Nan Video Games Nan Nan
.
.
Is there any efficient way to do this. I tried the below code but it took 3 hours to complete
#Getting the proper features from the features column
cols = {}
for choices in set_label:
collection_list = []
array = train["features"][train["label"] == choices].values
for i in range(1,len(array)):
var_split = array[i].split(";")
try :
d = (dict(s.split('=') for s in var_split))
for x in d.keys():
collection_list.append(x)
except ValueError:
Error = ValueError
count = Counter(collection_list)
for k , v in count.most_common(5):
key = k.replace(":","").replace(" ","_").lower()
cols[key] = v
columns_add = list(cols.keys())
train = train.reindex(columns = np.append( train.columns.values, columns_add))
print (train.columns)
print (train.shape)
#Adding the values for the newly created problem
for row in train.itertuples():
dummy_dic = {}
new_dict={}
value = train.loc[row.Index, 'features']
v_split = value.split(";")
try :
dummy_dict = (dict(s.split('=') for s in v_split))
for k, v in dummy_dict.items():
new_key = k.replace(":","").replace(" ","_").lower()
new_dict[new_key] = v
except ValueError:
Error = ValueError
for k,v in new_dict.items():
if k in train.columns:
train.loc[row.Index, k] = v
Is there any useful function that i can apply here for efficient way of feature extraction ?

Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria:
import pandas as pd
import numpy as np
import random
import time
import itertools
# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000
NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]
df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
FEATURES1[np.random.randint(0, len(FEATURES1))],\
FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]
df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]
df.rename(columns={0:"features"}, inplace=True)
print df.head(20)
# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point.
mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)
INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']
mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]
mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]
print mod_df.head(20)
Assume that your original data is stored in a DataFrame called df.
Solution 1 (all the features are the same for every data point).
def func2(y):
lista = y.split('=')
value = lista[1]
return value
def function(x):
lista = x.split(';')
array = [func2(i) for i in lista]
return array
# Calculate the execution time
start = time.time()
array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])
end = time.time()
new_df
print 'Total time:', end - start
Total time: 1.80923295021
Edit: The one thing you need to do is to edit accordingly the columns list.
Solution 2 (The features might be the same or different for every data point).
import pandas as pd
import numpy as np
import time
import itertools
# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
return x.split('=')[0]
def def_columns(x):
lista = x.split(';')
keys = [extract_key(i) for i in lista]
return keys
df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns
# This function turns each row from the original dataframe into a dictionary.
def function(x):
lista = x.split(';')
dict_ = {}
for i in lista:
key, val = i.split('=')
dict_[key ] = val
return dict_
df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)

Suppose your data is like this :
features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;",
'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;",
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})
I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ;
with this function
def replace_feature(x):
for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
x = x.replace(*r)
x = x.split(';')
return x
df = df.assign(features= df.features.apply(replace_feature))
After applying that function to your df all the values will a list of features. where you can get each one by index
then I use 4 customs function to get each attribute name, age, grade; job,
Note: There can be a better way to do this by using only one function
def get_name(df):
return df['features'][1]
def get_age(df):
return df['features'][2]
def get_grade(df):
return df['features'][3]
def get_job(df):
return df['features'][4]
And finaly applying that function to your dataframe :
df = df.assign(name = df.apply(get_name, axis=1),
age = df.apply(get_age, axis=1),
grade = df.apply(get_grade, axis=1),
job = df.apply(get_job, axis=1))
Hope this will be quick and fast

As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. It's better to create the whole dataframe at once whith a list of dictionnaries.
Let's recreate your input dataframe :
from StringIO import StringIO
data=StringIO("""id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;2.=Football Player; 1
3 name=David;age=12;1:=High School;2:=Cricketer; 2
4 name=George;age=11;1:=High School;2:=Carpenter; 2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
we can check :
print df
id features label
0 1 name=John Matthew;age=25;1.=Post Graduate;2.=F... 1
1 2 name=Mark clark;age=21;1.=Under Graduate;2.=Fo... 1
2 3 name=David;age=12;1:=High School;2:=Cricketer; 2
3 4 name=George;age=11;1:=High School;2:=Carpenter; 2
Now we can create the needed list of dictionnaries with the following code :
feat=[]
for line in df['features']:
line=line.replace(':','.')
lsp=line.split(';')[:-1]
feat.append(dict([elt.split('=') for elt in lsp]))
And the resulting dataframe :
print pd.DataFrame(feat)
1. 2. age name
0 Post Graduate Football Player 25 John Matthew
1 Under Graduate Football Player 21 Mark clark
2 High School Cricketer 12 David
3 High School Carpenter 11 George

Counting line frequencies and producing output files

With a textfile like this:
a;b
b;a
c;d
d;c
e;a
f;g
h;b
b;f
b;f
c;g
a;b
d;f
How can one read it, and produce two output text files: one keeping only the lines representing the most often occurring couple for each letter; and one keeping all the couples that include any of the top 25% of most commonly occurring letters.
Sorry for not sharing any code. Been trying lots of stuff with list comprehensions, counts, and pandas, but not fluent enough.

Here is an answer without frozen set.
df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']
df_all = pd.concat([df_count.assign(letter=lambda x: x['A']),
df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])
df_first = df_all.groupby(['letter']).first().reset_index()
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
------------older answer --------
Since order matters you can use a frozen set as the key to a groupby
import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']
Which will give you this
Combos Count
0 (a, b) 3
1 (b, f) 2
2 (d, c) 2
3 (g, f) 1
4 (b, h) 1
5 (c, g) 1
6 (d, f) 1
7 (e, a) 1
To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.
df_a = df_count.copy()
df_b = df_count.copy()
df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])
df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])
And since this is sorted by letter and count (descending) just get the first row of each group.
df_first = df_all.groupby('letter').first()
And to get the top 25%, just use
top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]
And then use .to_csv to output to file.

Python: Storing multiple dictionaries after replacing categoricals with integers

My data looks like this:
source browser sex age country class
SEO Chrome M 39 Japan 0
Ads Chrome F 53 United States 0
SEO Opera M 53 United States 1
SEO Safari M 41 NULL 0
Ads Safari M 45 United States 0
Ads Chrome M 18 Canada 0
In trying to get it ready for machine learning, I wrote a function to replace categoricals with integers:
def str2int(data):
y2= data
S = set(y2) #set
D = dict(zip(S, range(len(S)))) # assign each string an integer, and put it in a dict
Y = [D[y2_] for y2_ in y2] # store class labels as ints
return Y
I then call it using the below to convert all string columns to integers:
cols=['sex','browser','country','source']
for col in cols:
df_fraud[col] = convert_str_int(df_fraud[col])
I would like to store the dictionary associated with each column and call it later, which I could simply say "return Y, D" in the def function, but I am not sure how I would include it in my for function below.
Frankly, I am not sure what the best way to store these references in dictionaries are and am open to suggestions.
I have simplified the example below:
This is not working when using the suggested code. Any ideas?
def str2int(data):
y2= data
S = set(y2) #set
D = dict( zip(S, range(len(S))) ) # assign each string an integer, and put it in a dict
Y = [D[y2_] for y2_ in y2] # store class labels as ints
return Y, D
def make_str2int(data):
categories = set(data)
return dict(zip(categories, range(len(categories))))
raw_data = {
'names': ['A','B','B','D','D','E','B','B','E','F'],
'gender': ['M','F','F','F','F','M','M','M','M','M']}
str2int={}
cols = ['names', 'gender']
for col in cols:
str2int[col] = make_str2int(df_fraud[col])

I haven't tested, and I'm not sure to understand exactly how you intend to use the dictionaries, but here are my suggestions.
You could store the dictionaries in a dictionary of dictionaries:
def make_str2int(data):
categories = set(data)
return dict(zip(categories, range(len(categories))
str2int = {}
cols = ['sex', 'browser', 'country', 'source']
for col in cols:
str2int[col] = make_str2int(df_fraud[col])
(Assuming df_fraud represents your table (you didn't make this clear in your question.))
And then, if you want the categories existing in one column col, you can call:
str2int[col].keys()
If you want the corresponding numbers:
str2int[col].values()
If you want the number associated to a categorical value cat_val in a known column col:
str2int[col][cat_val]
Edit: Applying on your raw_data example
def make_str2int(data):
categories = set(data)
return dict(zip(categories, range(len(categories))))
raw_data = {
'names': ['A','B','B','D','D','E','B','B','E','F'],
'gender': ['M','F','F','F','F','M','M','M','M','M']}
str2int={}
cols = raw_data.keys()
for col in cols:
str2int[col] = make_str2int(raw_data[col])
print "Conversion examples:"
element = raw_data['names'][3]
print "%s -> %s" % (element, str2int['names'][element])
element = raw_data['gender'][2]
print "%s -> %s" % (element, str2int['gender'][element])
Output:
Conversion examples:
D -> 3
F -> 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do to fuzzy matching on excel file using Pandas? - python

Related

Python script to sum values according to conditions in a loop

Fill cell values by index and column in pandas from external list

Feature extraction from the training data

Counting line frequencies and producing output files

Python: Storing multiple dictionaries after replacing categoricals with integers

Categories

Resources