Python: Storing multiple dictionaries after replacing categoricals with integers - python

My data looks like this:
source browser sex age country class
SEO Chrome M 39 Japan 0
Ads Chrome F 53 United States 0
SEO Opera M 53 United States 1
SEO Safari M 41 NULL 0
Ads Safari M 45 United States 0
Ads Chrome M 18 Canada 0
In trying to get it ready for machine learning, I wrote a function to replace categoricals with integers:
def str2int(data):
y2= data
S = set(y2) #set
D = dict(zip(S, range(len(S)))) # assign each string an integer, and put it in a dict
Y = [D[y2_] for y2_ in y2] # store class labels as ints
return Y
I then call it using the below to convert all string columns to integers:
cols=['sex','browser','country','source']
for col in cols:
df_fraud[col] = convert_str_int(df_fraud[col])
I would like to store the dictionary associated with each column and call it later, which I could simply say "return Y, D" in the def function, but I am not sure how I would include it in my for function below.
Frankly, I am not sure what the best way to store these references in dictionaries are and am open to suggestions.
I have simplified the example below:
This is not working when using the suggested code. Any ideas?
def str2int(data):
y2= data
S = set(y2) #set
D = dict( zip(S, range(len(S))) ) # assign each string an integer, and put it in a dict
Y = [D[y2_] for y2_ in y2] # store class labels as ints
return Y, D
def make_str2int(data):
categories = set(data)
return dict(zip(categories, range(len(categories))))
raw_data = {
'names': ['A','B','B','D','D','E','B','B','E','F'],
'gender': ['M','F','F','F','F','M','M','M','M','M']}
str2int={}
cols = ['names', 'gender']
for col in cols:
str2int[col] = make_str2int(df_fraud[col])

I haven't tested, and I'm not sure to understand exactly how you intend to use the dictionaries, but here are my suggestions.
You could store the dictionaries in a dictionary of dictionaries:
def make_str2int(data):
categories = set(data)
return dict(zip(categories, range(len(categories))
str2int = {}
cols = ['sex', 'browser', 'country', 'source']
for col in cols:
str2int[col] = make_str2int(df_fraud[col])
(Assuming df_fraud represents your table (you didn't make this clear in your question.))
And then, if you want the categories existing in one column col, you can call:
str2int[col].keys()
If you want the corresponding numbers:
str2int[col].values()
If you want the number associated to a categorical value cat_val in a known column col:
str2int[col][cat_val]
Edit: Applying on your raw_data example
def make_str2int(data):
categories = set(data)
return dict(zip(categories, range(len(categories))))
raw_data = {
'names': ['A','B','B','D','D','E','B','B','E','F'],
'gender': ['M','F','F','F','F','M','M','M','M','M']}
str2int={}
cols = raw_data.keys()
for col in cols:
str2int[col] = make_str2int(raw_data[col])
print "Conversion examples:"
element = raw_data['names'][3]
print "%s -> %s" % (element, str2int['names'][element])
element = raw_data['gender'][2]
print "%s -> %s" % (element, str2int['gender'][element])
Output:
Conversion examples:
D -> 3
F -> 1

Related

How to select list elements based on crteria from other lists

I am new to Python, coming from SciLab (an open source MatLab ersatz), which I am using as a toolbox for my analyses (test data analysis, reliability, acoustics, ...); I am definitely not a computer science lad.
I have data in the form of lists of same length (vectors of same size in SciLab).
I use some of them as parameter in order to select data from another one; e.g.
t_v = [1:10]; // a parameter vector
p_v = [20:29]; another parameter vector
res_v(t_v > 5 & p_v < 28); // are the res_v vector elements of which "corresponding" p_v and t_v values comply with my criteria; i can use it for analyses.
This is very direct and simple in SciLab; I did not find the way to achieve the same with Python, either "Pythonically" or simply translated.
Any idea that could help me, please?
Have a nice day,
Patrick.
You could use numpy arrays. It's easy:
import numpy as np
par1 = np.array([1,1,5,5,5,1,1])
par2 = np.array([-1,1,1,-1,1,1,1])
data = np.array([1,2,3,4,5,6,7])
print(par1)
print(par2)
print(data)
bool_filter = (par1[:]>1) & (par2[:]<0)
# example to do it directly in the array
filtered_data = data[ par1[:]>1 ]
print( filtered_data )
#filtering with the two parameters
filtered_data_twice = data[ bool_filter==True ]
print( filtered_data_twice )
output:
[1 1 5 5 5 1 1]
[-1 1 1 -1 1 1 1]
[1 2 3 4 5 6 7]
[3 4 5]
[4]
Note that it does not keep the same number of elements.
Here's my modified solution according to your last comment.
t_v = list(range(1,10))
p_v = list(range(20,29))
res_v = list(range(30,39))
def first_idex_greater_than(search_number, lst):
for count, number in enumerate(lst):
if number > search_number:
return count
def first_idex_lower_than(search_number, lst):
for count, number in enumerate(lst[::-1]):
if number < search_number:
return len(lst) - count # since I searched lst from top to bottom,
# I need to also reverse count
t_v_index = first_idex_greater_than(5, t_v)
p_v_index = first_idex_lower_than(28, p_v)
print(res_v[min(t_v_index, p_v_index):max(t_v_index, p_v_index)])
It returns an array [35, 36, 37].
I'm sure you can optimize it better according to your needs.
The problem statement is not clearly defined, but this is what I interpret to be a likely solution.
import pandas as pd
tv = list(range(1, 11))
pv = list(range(20, 30))
res = list(range(30, 40))
df = pd.DataFrame({'tv': tv, 'pv': pv, 'res': res})
print(df)
def criteria(row, col1, a, col2, b):
if (row[col1] > a) & (row[col2] < b):
return True
else:
return False
df['select'] = df.apply(lambda row: criteria(row, 'tv', 5, 'pv', 28), axis=1)
selected_res = df.loc[df['select']]['res'].tolist()
print(selected_res)
# ... or another way ..
print(df.loc[(df.tv > 5) & (df.pv < 28)]['res'])
This produces a dataframe where each column is the original lists, and applies a selection criteria, based on columns tv and pv to identify the rows in which the criteria, applied dependently to the 2 lists, is satisfied (or not), and then creates a new column of booleans identifying the rows where the criteria is either True or False.
[35, 36, 37]
5 35
6 36
7 37

Rectangular array with holes

I'm trying to create a rectangular grid with numbers in some cells (but not all of them), in a way such that it's easy to select a given row or column.
What I did so far is to create the list of the positions of the numbers in the grid and the list of the numbers contained in the grid, so that I can select the number at position (i,j) with numbers[positions.index([i,j]), but this is not very handy, especially if I need, for example, to find the minimum of the values in a given column.
Is there a way to create the grid so that, for example, I can select elements with grid[i][j] and columns with grid[:][j] or something similar? The programming language is Python.
You can use numpy for this. It lets you create an array, which can index a single value with array[i,j] or a full column with array[:,j].
I'm not entirely sure what you mean by holes, but numpy will require you to have a value in every spot in the array. The best thing I believe you can set it to a preset "empty" value.
Store your grid as a 2D array (a matrix) and use list comprehensions.
first_column = [row[0] for row in grid]
second_column = [row[1] for row in grid]
If you're going to have a large proportion of the "cells" that are unused, you could try using a dictionary with the coordinates as key in a tuple.
matrix = dict()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
valuesInRow2 = [v for (r,c),v in matrix.items() if r==2]
# [23,27]
By creating a subclass of dict to manage indexing and overriding operators, you could get it to behave exactly the way you want:
class Sparse(dict):
def __init__(self,rows=0,cols=0):
super().__init__()
self.rows = rows
self.cols = cols
def __indexToRanges(self,rowIndex,colIndex):
scalar = isinstance(rowIndex,int) and isinstance(colIndex,int)
if isinstance(rowIndex,slice):
rowRange = range(*rowIndex.indices(self.rows))
else:
rowRange = range(rowIndex,rowIndex+1)
if isinstance(colIndex,slice):
colRange = range(*colIndex.indices(self.cols))
else:
colRange = range(colIndex,colIndex+1)
return rowRange,colRange,scalar
def __getitem__(self,indexes):
row,col = indexes
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar: return super()._getitem((row,col))
return [v for (r,c),v in self.items() if r in rowRange and c in colRange]
def __setitem__(self,index,value):
row,col=index
rowRange,colRange,scalar = self.__indexToRanges(row,col)
if scalar:
self.rows = max(self.rows,row+1)
self.cols = max(self.cols,col+1)
return super().__setitem__((row,col),value)
usage:
matrix = Sparse()
matrix[1,3] = 13
matrix[1,5] = 15
matrix[2,3] = 23
matrix[2,7] = 27
matrix[3,7] = 37
print("sum of column 3:", sum(matrix[:,3]) ) # 36
print("sum of row 2:", sum(matrix[2,:]) ) # 50
print("top left 4x4 values:", matrix[:4,:4] ) # [13, 23]

Generate many to many list of dictionary relation from python data frame

import pandas as pd
import os
list_of_dict2 = [[{'1580674': ['HA-567034786', 'AB-1018724']}], [{'1554970': ['AB-6348403', 'HA-7298656']}, {'1554970': ['AB-2060953', 'HA-990228']}, {'1554970': ['HA-7287204', 'AB-1092380','GR-33333']}]]
list_of_dict = []
for i in list_of_dict2:
for j in i:
list_of_dict.append(list(j.values())[0])
df = pd.DataFrame(list_of_dict)
print(df)
My Current dataframe result
0 1 2
0 HA-567034786 AB-1018724 None
1 AB-6348403 HA-7298656 None
2 AB-2060953 HA-990228 None
3 HA-7287204 AB-1092380 GR-33333
using the list of a dictionary I can generate the data frame with my below code. But my problem is I am having some
problem to make it many to many list of dictionary. Let me explain what I want to achieve.
For example, for every row of data frame, I want to make it many to many dictionaries with multiple values on the list. Say, with the last index 3 I want to make it like below
Expected Output:(for 2nd index)
{
"AB-2060953" : ['HA-990228'],
"HA-990228" : ['AB-2060953']
}
Expected Output:(for 3rd index)
{
"HA-7287204" : ['AB-1092380','GR-33333'],
"AB-1092380" : ['HA-7287204','GR-33333'],
"GR-33333" : ['AB-1092380','HA-7287204']
}
One approach could be the following:
def make_dict(row):
s = set(row[~row.isna()])
return {x: list(s - {x}) for x in s}
df.apply(make_dict, axis=1)
# Output:
0 {'AB-1018724': ['HA-567034786'], 'HA-567034786': ['AB-1018724']}
1 {'AB-6348403': ['HA-7298656'], 'HA-7298656': ['AB-6348403']}
2 {'HA-990228': ['AB-2060953'], 'AB-2060953': ['HA-990228']}
3 {'GR-33333': ['AB-1092380', 'HA-7287204'], 'AB-1092380': ['GR-33333', 'HA-7287204'], 'HA-7287204': ['GR-33333', 'AB-1092380']}
dtype: object
Or, without assuming uniqueness and dealing with sets,
df.apply(lambda row: {x: [y for y in row if y and x != y] for x in row if x}, axis=1)

Feature extraction from the training data

I have a training data like below which have all the information under a single column. The data set has above 300000 data.
id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games; 1
3 name=David;age=12;1:=High School;2:=Cricketer;native=america; 2
4 name=George;age=11;1:=High School;2:=Carpenter;married=yes 2
.
.
300000 name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No 3
Now i need to convert this training data like below
id name age 1 2 Interest married Smoker
1 John Matthew 25 Post Graduate Football Player Nan Nan Nan
2 Mark clark 21 Under Graduate Nan Video Games Nan Nan
.
.
Is there any efficient way to do this. I tried the below code but it took 3 hours to complete
#Getting the proper features from the features column
cols = {}
for choices in set_label:
collection_list = []
array = train["features"][train["label"] == choices].values
for i in range(1,len(array)):
var_split = array[i].split(";")
try :
d = (dict(s.split('=') for s in var_split))
for x in d.keys():
collection_list.append(x)
except ValueError:
Error = ValueError
count = Counter(collection_list)
for k , v in count.most_common(5):
key = k.replace(":","").replace(" ","_").lower()
cols[key] = v
columns_add = list(cols.keys())
train = train.reindex(columns = np.append( train.columns.values, columns_add))
print (train.columns)
print (train.shape)
#Adding the values for the newly created problem
for row in train.itertuples():
dummy_dic = {}
new_dict={}
value = train.loc[row.Index, 'features']
v_split = value.split(";")
try :
dummy_dict = (dict(s.split('=') for s in v_split))
for k, v in dummy_dict.items():
new_key = k.replace(":","").replace(" ","_").lower()
new_dict[new_key] = v
except ValueError:
Error = ValueError
for k,v in new_dict.items():
if k in train.columns:
train.loc[row.Index, k] = v
Is there any useful function that i can apply here for efficient way of feature extraction ?
Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria:
import pandas as pd
import numpy as np
import random
import time
import itertools
# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000
NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]
df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
FEATURES1[np.random.randint(0, len(FEATURES1))],\
FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]
df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]
df.rename(columns={0:"features"}, inplace=True)
print df.head(20)
# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point.
mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)
INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']
mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]
mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]
print mod_df.head(20)
Assume that your original data is stored in a DataFrame called df.
Solution 1 (all the features are the same for every data point).
def func2(y):
lista = y.split('=')
value = lista[1]
return value
def function(x):
lista = x.split(';')
array = [func2(i) for i in lista]
return array
# Calculate the execution time
start = time.time()
array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])
end = time.time()
new_df
print 'Total time:', end - start
Total time: 1.80923295021
Edit: The one thing you need to do is to edit accordingly the columns list.
Solution 2 (The features might be the same or different for every data point).
import pandas as pd
import numpy as np
import time
import itertools
# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
return x.split('=')[0]
def def_columns(x):
lista = x.split(';')
keys = [extract_key(i) for i in lista]
return keys
df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns
# This function turns each row from the original dataframe into a dictionary.
def function(x):
lista = x.split(';')
dict_ = {}
for i in lista:
key, val = i.split('=')
dict_[key ] = val
return dict_
df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)
Suppose your data is like this :
features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;",
'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;",
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})
I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ;
with this function
def replace_feature(x):
for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
x = x.replace(*r)
x = x.split(';')
return x
df = df.assign(features= df.features.apply(replace_feature))
After applying that function to your df all the values will a list of features. where you can get each one by index
then I use 4 customs function to get each attribute name, age, grade; job,
Note: There can be a better way to do this by using only one function
def get_name(df):
return df['features'][1]
def get_age(df):
return df['features'][2]
def get_grade(df):
return df['features'][3]
def get_job(df):
return df['features'][4]
And finaly applying that function to your dataframe :
df = df.assign(name = df.apply(get_name, axis=1),
age = df.apply(get_age, axis=1),
grade = df.apply(get_grade, axis=1),
job = df.apply(get_job, axis=1))
Hope this will be quick and fast
As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. It's better to create the whole dataframe at once whith a list of dictionnaries.
Let's recreate your input dataframe :
from StringIO import StringIO
data=StringIO("""id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;2.=Football Player; 1
3 name=David;age=12;1:=High School;2:=Cricketer; 2
4 name=George;age=11;1:=High School;2:=Carpenter; 2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
we can check :
print df
id features label
0 1 name=John Matthew;age=25;1.=Post Graduate;2.=F... 1
1 2 name=Mark clark;age=21;1.=Under Graduate;2.=Fo... 1
2 3 name=David;age=12;1:=High School;2:=Cricketer; 2
3 4 name=George;age=11;1:=High School;2:=Carpenter; 2
Now we can create the needed list of dictionnaries with the following code :
feat=[]
for line in df['features']:
line=line.replace(':','.')
lsp=line.split(';')[:-1]
feat.append(dict([elt.split('=') for elt in lsp]))
And the resulting dataframe :
print pd.DataFrame(feat)
1. 2. age name
0 Post Graduate Football Player 25 John Matthew
1 Under Graduate Football Player 21 Mark clark
2 High School Cricketer 12 David
3 High School Carpenter 11 George

How do to fuzzy matching on excel file using Pandas?

I have a table called account with two columns - ID & NAME. ID is a hash which is unique but NAME is a string which might have duplicates.
I'm trying to write a python script to read this excel file and match 0-3 similar NAME values, but I just cannot seem to get it to work.
Could someone help out? Thanks
import pandas as pd
from fuzzywuzzy import fuzz
import difflib
def get_spr(row):
d = name1.apply(lambda x: (fuzz.ratio(x['NAME'], row['NAME']) * 0 if row['ID'] == x['ID'] else 1), axis=1)
d = d[d>= 60]
if len(d) == 0:
v = ['']*2
else:
v = name1.ix[d.idxmax(),['ID' , 'NAME']].values
return pd.Series(v, index=['ID', 'NAME'])
def score(tablerow):
d = name1.apply(lambda x: fuzz.ratio(x['NAME'],tablerow['NAME']) * (0 if x['ID']==tablerow['ID'] else 1), axis=1)
d = d[d>90]
if len(d) == 0:
v = [''] * 2
else:
v = name1.ix[d.order(ascending=False).head(3).index, ['ID' , 'NAME']].values
return pd.DataFrame(v, index=['ID', 'NAME'])
account = "account_test.xlsx"
xl_acc1 = pd.ExcelFile(account)
xl_acc2 = pd.ExcelFile(account)
acc1 = xl_acc1.parse(xl_acc1.sheet_names[0])
acc2 = xl_acc2.parse(xl_acc2.sheet_names[0])
name1 = acc1[pd.notnull(acc1['NAME'])]
name2 = acc2[pd.notnull(acc2['NAME'])]
print 'Doing Fuzzy Matching'
name2= pd.concat((name2,name2.apply(get_spr, axis=1)), axis=1)
name2.to_excel(pd.ExcelWriter('res.xlsx'),'acc')
Any help would be much appreciated!
The file has rows like this:-
ID NAME
0016F00001c7GDZQA2 Daniela Abriani
0016F00001c7GPnQAM Daniel Abriani
0016F00001c7JRrQAM Nisha Well
0016F00001c7Jv8QAE Katherine
0016F00001c7cXiQAI Katerine
0016F00001c7dA3QAI Katherin
0016F00001c7kHyQAI Nursing and Midwifery Council Research Office
0016F00001c8G8OQAU Nisa Well
Expected (output dataframe) would be something like:
ID NAME ID2 NAME2
<hash1> katherine <hash2> katerine
<hash1> katherine <hash3> katherin
<hash4> Nisa Well <hash5> Nisha Well
Issue: The above code just reproduces the input as the output saved file without actually concatenating any matches.
I don't think you need to do this in pandas. Here is my sloppy solution but it gets your desired output using a dictionary.
from fuzzywuzzy import process
df = pd.DataFrame([
['0016F00001c7GDZQA2', 'Daniela Abriani'],
['0016F00001c7GPnQAM', 'Daniel Abriani'],
['0016F00001c7JRrQAM', 'Nisha Well'],
['0016F00001c7Jv8QAE', 'Katherine'],
['0016F00001c7cXiQAI', 'Katerine'],
['0016F00001c7dA3QAI', 'Katherin'],
['0016F00001c7kHyQAI', 'Nursing and Midwifery Council Research Office'],
['0016F00001c8G8OQAU', 'Nisa Well']],
columns=['ID', 'NAME'])
get unique hashes in to a dictionary.
hashdict = dict(zip(df['ID'], df['NAME']))
define a function checkpair. You'll need it to remove reciprocal hash pairs. This method will add (hash1, hash2) and (hash2, hash1), but I think you only want to keep one of those pairs:
def checkpair (a,b,l):
for x in l:
if (a,b) == (x[2],x[0]):
l.remove(x)
Now iterate through hashdict.items() finding the top 3 matches along the way. The fuzzywuzzy docs detail the process method.
matches = []
for k,v in hashdict.items():
#see docs for extract -- 4 because you are comparing a name to itself
top3 = process.extract(v, hashdict, limit=4)
#remove the hashID compared to itself
for h in top3:
if k == h[2]:
top3.remove(h)
#append tuples to the list "matches" if it meets a score criteria
[matches.append((k, v, x[2], x[0], x[1])) for x in top3 if x[1] > 60] #change score?
#remove reciprocal pairs
[checkpair(m[0], m[2], matches) for m in matches]
df = pd.DataFrame(matches, columns=['id1', 'name1', 'id2', 'name2', 'score'])
# write to file
writer = pd.ExcelWriter('/path/to/your/file.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Output:
id1 name1 id2 name2 score
0 0016F00001c7JRrQAM Nisha Well 0016F00001c8G8OQAU Nisa Well 95
1 0016F00001c7GPnQAM Daniel Abriani 0016F00001c7GDZQA2 Daniela Abriani 97
2 0016F00001c7Jv8QAE Katherine 0016F00001c7dA3QAI Katherin 94
3 0016F00001c7Jv8QAE Katherine 0016F00001c7cXiQAI Katerine 94
4 0016F00001c7dA3QAI Katherin 0016F00001c7cXiQAI Katerine 88

Categories