Multiprocessing fuzzy wuzzy string search - python - python

I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.
What I tried so far,
First I tried to use record linkage package in python, unfortunately it ran out of memory when it build the multi index, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.
Then, I tried to do string match with fuzzy wuzzy and parallelise the process using dask package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']})
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})
Here, I am trying to look for test.Address1 in test2.Address1 and bring its ID.
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'ID']
dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)
This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.
Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,
# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)
I don't know how to do that ? Please advise. Thanks in advance.

I prefer using fuzzywuzzy.process.extractOne. That compares a string to an iterable of strings.
def extract_one(col, other):
# need this for dask later
other = other.compute() if hasattr(other, 'compute') else other
return pd.DataFrame([process.extractOne(x, other) for x in col],
columns=['Address1', 'score', 'idx'],
index=col.index)
extract_one(test.Address1, test2.Address1)
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The idx is the index of the other passed to extract_one that matches closest. I would recommend having a meaningful index, to making joining the results later on easier.
For your second question, about filtering to cities, I would use a groupby and apply
gr1 = test.groupby('city')
gr2 = test2.groupby("city")
gr1.apply(lambda x: extract_one(x.Address1,
gr2.get_group(x.name).Address1))
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The only difference with dask is the need to specify a meta to the apply:
ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)
dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')
meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1,
dgr2.get_group(x.name).Address1),
meta=meta).compute()
Address1 score idx
city
U 0 234 kookie Pl 83 1
1 234 kookie Pl 28 1
X 0 123 chese wy 92 0
1 123 chese wy 28 0
Here's a notebook: https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad
I'm curious to hear how the performance is. I'm assuming the actual string comparison done in fuzzy wuzzy will take the bulk of the time, but I'd love to hear back on how much overhead is spent in pandas and dask. Make sure you have the C extentions for computing the Levenshtein distance.

i run on the same problem once. The whole process takes forever and even you will use multiprocessing it is not really going to be very fast. The main problem that causes the slow speed is the fuzzy matching because the processing is very tedious and requires a lot of time.
Alternatively, and more efficient in my opinion, it would be to use embedding aka bag of words and apply an ML method on it. the fact that you will use numerical vector makes the whole process way faster!

Related

2-way lookup in sas

Let's say I have a (9000x9000) table like the following:
zone 304 305 306 307 308 ...
001 1 2 8 9 12 ...
002 6 8 3 7 1 ...
003 4 8 1 12 9 ...
004 2 7 3 16 34 ...
...
The main data table looks like this:
package # weight origin destination zone
123 2oz 004 305 7 to be inputted here
.
.
.
I need SAS to output the "zone" corresponding to a given ordered pair. I fear the only way would be with some type of loop. For instance, in the example above, the orgin value is from the row labels and the destination from the column labels. The intersection is the target value I need assigned to "zone".
A solution using python data wrangling libraries would work also.
Also, the 9000x9000 table is an Excel CSV file.
You could use pandas, it has a built in function to read from an excel document: pandas.read_excel()
So for this file:
import pandas as pd
df = pd.read_excel('test.xlsx')
print(df[101][502])
Output:
67
My approach:
Load the data set into a temporary array (9000x9000) and then lookup each element as needed. Could be memory intensive, but 9000*9000 seems small enough to me.
Another safe approach, transpose the data to be in a long format:
Key1 Key2 Value
001 304 1
001 305 2
...
Then, in any language, it becomes a join/merge instead of lookup.
You can also use PROC IML, which loads the data as a matrix and then you can access using the indexes.
There are also ways in SAS to do this lookup via a merge, primarily using VVALUEX.
Without knowing how you're going to use it, I can't provide any more information.
EDIT: added 3'rd option which is IML. Basically there are many ways to do this, the best depends on how you're planning to use it overall.
EDIT2:
1. Import first data set into SAS (PROC IMPORT)
2. Transpose using PROC TRANSPOSE
3. Merge either data step or PROC SQL, by ORIGIN DESTINATION, which will be straight forward. At this point it's really a standard lookup with 2 keys.

Speeding up cross-reference filtering in Pandas DB

I am working with a very large donation database of data with relevant columns for donation ID, conduit ID, amount, for example:
TRANSACTION_ID BACK_REFERENCE_TRAN_ID_NUMBER CONTRIBUTION_AMOUNT
0 VR0P4H2SEZ1 0 100
1 VR0P4H3X770 0 2700
2 VR0P4GY6QV1 0 500
3 VR0P4H3X720 0 1700
4 VR0P4GYHHA0 VR0P4GYHHA0E 200
What I need to do is to identify all of the rows where the TRANSACTION_ID corresponds to any BACK_REFERENCE_TRAN_ID_NUMBER. My current code, albeit a little clumsy, is:
is_from_conduit = df[df.BACK_REFERENCE_TRAN_ID_NUMBER != "0"].BACK_REFERENCE_TRAN_ID_NUMBER.tolist()
df['CONDUIT_FOR_OTHER_DONATION'] = 0
for row in df.index:
if df['TRANSACTION_ID'][row] in is_from_conduit:
df['CONDUIT_FOR_OTHER_DONATION'][row] = 1
else:
df['CONDUIT_FOR_OTHER_DONATION'][row] = 0
However, on very large data sets with a large number of conduit donations, this takes for ever. I know there must be a simpler way, but clearly I can't come up with how to phrase this to find out what that may be.
You can use Series.isin. It is a vectorized operation that checks if each element of the Series is in a supplied iterable.
df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique())
As #root mentioned if you prefer 0/1 (as in your example) instead of True/False, you can cast to int:
df['CONDUIT_FOR_OTHER_DONATION'] = df['TRANSACTION_ID'].isin(df['BACK_REFERENCE_TRAN_ID_NUMBER'].unique()).astype(int)
Here's a NumPy based approach using np.in1d -
vals = np.in1d(df.TRANSACTION_ID,df.BACK_REFERENCE_TRAN_ID_NUMBER).astype(int)
df['CONDUIT_FOR_OTHER_DONATION'] = vals

How to *extract* latitud and longitude greedily in Pandas?

I have a dataframe in Pandas like this:
id loc
40 100005090 -38.229889,-72.326819
188 100020985 ut: -33.442101,-70.650327
249 10002732 ut: -33.437478,-70.614637
361 100039605 ut: 10.646041,-71.619039 \N
440 100048229 4.666439,-74.071554
I need to extract the gps points. I first ask for a contain of a certain regex (found here in SO, see below) to match all cells that have a "valid" lat/long value. However, I also need to extract these numbers and either put them on a series of their own (and then call split on the comma) or put them in two new pandas series. I have tried the following for the extraction part:
ids_with_latlong["loc"].str.extract("[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$")
but it looks, because of the output, that the reg exp is not doing the matching greedily, because I get something like this:
0 1 2 3 4 5 6 7 8
40 38.229889 .229889 NaN 72.326819 NaN 72 NaN 72 .326819
188 33.442101 .442101 NaN 70.650327 NaN 70 NaN 70 .650327
Obviously it's matching more than I want (I would just need cols 0, 1, and 4), but simply dropping them is too much of a hack for me to do. Notice that the extract function also got rid of the +/- signs at the beginning. If anyone has a solution, I'd really appreciate.
#HYRY's answer looks pretty good to me. This is just an alternate approach that uses built in pandas methods rather than a regex approach. I think it's a little simpler to read though I'm not sure if it will be sufficiently general for all your cases (it works fine on this sample data though).
df['loc'] = df['loc'].str.replace('ut: ','')
df['lat'] = df['loc'].apply( lambda x: x.split(',')[0] )
df['lon'] = df['loc'].apply( lambda x: x.split(',')[1] )
id loc lat lon
0 100005090 -38.229889,-72.326819 -38.229889 -72.326819
1 100020985 -33.442101,-70.650327 -33.442101 -70.650327
2 10002732 -33.437478,-70.614637 -33.437478 -70.614637
3 100039605 10.646041,-71.619039 10.646041 -71.619039
4 100048229 4.666439,-74.071554 4.666439 -74.071554
As a general suggestion for this type of approach you might think about doing in in the following steps:
1) remove extraneous characters with replace (or maybe this is where the regex is best)
2) split into pieces
3) check that each piece is valid (all you need to do is check that it's a number although you could take an extra step that it falls into the number range of being a valid lat or lon)
You can use (?:) to ignore the group:
df["loc"].str.extract(r"((?:[\+-])?\d+\.\d+)\s*,\s*((?:[\+-])?\d+\.\d+)")

Comparison between one element and all the others of a DataFrame column

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the follow rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?
Then, I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.
If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question at all.
Edit to add: As mentioned in a comment to #AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.
Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.
Using numpy.round, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.
So, the number of frags per bin is:
df.frag.groupby(np.round(df.mass, 6)).nunique()
Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).
You will get a new list with all unique masses.
Just don't forget that if you do need to use the original list later you will need to use deepcopy.

Python conditional filtering in csv file

Please help! I have tried different things/packages writing a program that takes in 4 inputs and returns the writing score statistics of a group based on those combination of inputs from a csv file. This is my first project, so I would appreciate any insights/hints/tips!
Here is the csv sample (has 200 rows total):
id gender ses schtyp prog write
70 male low public general 52
121 female middle public vocation 68
86 male high public general 33
141 male high public vocation 63
172 male middle public academic 47
113 male middle public academic 44
50 male middle public general 59
11 male middle public academic 34
84 male middle public general 57
48 male middle public academic 57
75 male middle public vocation 60
60 male middle public academic 57
Here is what I have so far:
import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
data.append(row)
data=numpy.array(data)
#asks for inputs
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')
#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())
What I am missing is how to filter and gets stats only for a specific group. For example, say I input male, public, middle, and academic -- I'd want to get the average writing score for that subset. I tried the groupby function from pandas, but that only gets you stats for broad groups (such as public vs private). I also tried DataFrame from pandas, but that only gets me filtering for one input and not sure how to get the writing scores. Any hints would be greatly appreciated!
Agreeing with Ramon, Pandas is definitely the way to go, and has extraordinary filtering/sub-setting capability once you get used to it. But it can be tough to first wrap your head around (or at least it was for me!), so I dug up some examples of the sub-setting you need from some of my old code. The variable itu below is a Pandas DataFrame with data on various countries over time.
# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania' # returns True/False values
itu[subset] # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines
# Pandas has many built-in functions like .isin() to provide params to filter on
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time
# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]
# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) &
itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]
# Finally, if you're comfortable with using map() and list comprehensions,
you can do some advanced subsetting that includes evaluations & functions
to determine what elements you want to select from the whole, such as all
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName'] # gives us UAE, UK, & US
Look at pandas. I think it will shorten your csv parsing work and gives the subset funcitonality you're asking for...
import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)
#get all of the male students
data[data['gender'] == 'male']

Categories