How to optimize levenstien edit distance on pandas dataframe using python? - python

I am running levenstein comparison on 50k records. I need to compare each record between each other. Is there a way how to optimize the following code to run it faster? The data is stored in pandas dataframe.
import pandas as pd
import numpy as np
import Levenshtein
df_s_sorted = df.sort_values(['nonascii_2', 'birth_date'])
df_similarity = pd.DataFrame()
q=0
for index, p in df_s_sorted.iterrows():
q = q + 1
print(q)
for index, p1 in df_s_sorted.iterrows():
if ((p["birth_date"] == p1["birth_date"]) & (p["name"] != p1["name"] )):
if (Levenshtein.distance(p["name"],p1["name"]) == 1):
df_similarity = df_similarity.append(p)
print(p)
df_s_sorted.drop(index, inplace=True)

Related

iterating over each row in pandas to evaluate condition

I have the following code
import pandas as pd
from pandas_datareader import data as web
import numpy as np
import math
data = web.DataReader('goog', 'yahoo')
df['lifetime'] = data['High'].asfreq('D').rolling(window=999999, min_periods=1).max() #To check if it is a lifetime high
How can i compare it so that i get a boolean (in 1 and 0 preferably) if df['High'] is close to its df['lifetime'] for each row in pandas:
data['isclose'] = math.isclose(data['High'], data['lifetime'], rel_tol = 0.003)
Any help would be appreciated.
You can use np.where()
import numpy as np
import math
data['isclose'] = np.where(math.isclose(data['High'], data['lifetime'], rel_tol = 0.003), 1, 0)
You could also use pandas' apply() function:
import math
from pandas_datareader import data as web
data = web.DataReader("goog", "yahoo")
data["lifetime"] = data["High"].asfreq("D").rolling(window=999999, min_periods=1).max()
data["isclose"] = data.apply(
lambda row: 1 if math.isclose(row["High"], row["lifetime"], rel_tol=0.003) else 0,
axis=1,
)
print(data)
However, yudhiesh's solution using np.where() is faster.
See also: Why is np.where faster than pd.apply

How to improve performance on average calculations in python dataframe

I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price

How to get the pandas series diff in a for loop?

I have a timeseries called ts, with some values as shown below:
import numpy as np
import pandas as pd
ts = pd.Series(range(10))
ts.index = pd.date_range('2019-01-01',periods=len(ts))
ts
I can get multiple differencing like this:
ts.diff().dropna()
ts.diff().diff().dropna()
How can I do this using for loop?
for d in range(7):
tsx = ? # I dont know what to do here?
We have pd.eval
for d in range(7):
tsx = pd.eval('ts'+'.diff()'*d+'.dropna()')
You can simply use eval function:
for d in range(7):
tsx = eval('ts' + '.diff()'*d + '.dropna()')

convert pandas df to multi-dimensional numpy array

I have very sparse data in a pandas dataframe with 25million+ records. This has to be converted into a multi dimensional numpy array. I have written this the straightforward way using a for loop, and was wondering if there is a more efficient way.
import numpy as np
import pandas as pd
facts_pd = pd.DataFrame.from_records(columns=['name','offset','code'],
data=[('John', -928, 'dx_434'), ('Steve',-757,'dx_5859'), ('Jack',-800,'dx_250'),
('John',-919,'dx_401'),('John',-956,'dx_5859')])
name_lu = pd.DataFrame(sorted(facts_pd['name'].unique()), columns=['name'])
name_lu["nameid"] = name_lu.index
offset_lu = pd.DataFrame(sorted(facts_pd['offset'].unique(), reverse=True), columns=['offset'])
offset_lu["offsetid"] = offset_lu.index
code_lu = pd.DataFrame(sorted(facts_pd['code'].unique()), columns=['code'])
code_lu["codeid"] = code_lu.index
facts_pd = pd.merge(pd.merge(pd.merge(facts_pd, name_lu, how="left", on="name")
, offset_lu, how="left", on="offset"), code_lu, how="left", on="code")
facts_pd.drop(["name","offset","code"], inplace=True, axis=1)
facts_np = np.zeros((len(name_lu),len(offset_lu),len(code_lu)))
for row in facts_pd.iterrows():
i,j,k = row[1]
facts_np[i][j][k] = 1
The command you are probably looking for is dataframe.as_matrix() this will return a numpy array and not a matrix despite what the command says here is the man pages for it.
Here is another stack overflow topic on the use of it as well
Refurbished code:
import numpy as np
import pandas as pd
facts_pd = pd.DataFrame.from_records(columns=['name','offset','code'],
data=[('John', -928, 'dx_434'), ('Steve',-757,'dx_5859'), ('Jack',-800,'dx_250'),
('John',-919,'dx_401'),('John',-956,'dx_5859')])
facts_np = facts_pd.as_matrix()
print facts_np # Displays the data frame in numpy array format.

how to make python loop faster to run pairwise association test

I have a list of patient id and drug names and a list of patient id and disease names. I want to find the most indicative drug for each disease.
To find this I want to do Fisher exact test to get the p-value for each disease/drug pair. But the loop runs very slowly, more than 10 hours. Is there a way to make the loop more efficient, or a better way to solve this association problem?
My loop:
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
for i in disease_list:
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
You need multiprocessing/multithreading, I have added the code.:
from multiprocessing.dummy import Pool as ThreadPool
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
def run_pairwise(i):
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
pool = ThreadPool(3)
pairwise_test_results = pool.map(run_pairwise,disease_list)
pool.close()
pool.join()
notes:http://chriskiehl.com/article/parallelism-in-one-line/
Crunching faster is good, but a better algorithm usually beats it ;-)
Filling in a bit,
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
# data files can be download at
# https://github.com/Saynah/platform/tree/d7e9f150ef2ff436387585960ca312a301847a46/data
meps_meds = pd.read_csv("meps_meds.csv") # 8 cols * 1,148,347 rows
meps_base_data = pd.read_csv("meps_base_data.csv") # 18 columns * 61,489 rows
# merge to get disease and drug info in same table
merged = pd.merge( # 25 cols * 1,148,347 rows
meps_base_data, meps_meds,
how='inner', left_on='id', right_on='id'
)
rx_list = meps_meds.rxName.unique().tolist() # 9218 items
disease_list = meps_base_data.columns.values[8:].tolist() # 10 items
Note that rx_list has a LOT of duplicates (eg. 52 entries for Amoxicillin, if you include misspellings).
Then
most_indicative = {}
for disease in disease_list:
# get unique (id, disease, prescription)
subset = merged[['id', disease, 'rxName']].drop_duplicates()
# keep only Yes/No entries under disease
subset = subset[subset[disease].isin(['Yes', 'No'])]
# summarize (replaces the inner loop)
tab = pd.crosstab(subset.rxName, subset[disease])
# bind "No" values for Fisher exact function
nf, yf = tab.sum().values
def p_value(x, nf=nf, yf=yf):
return fisher_exact([[nf - x.No, x.No], [yf - x.Yes, x.Yes]])[1]
# OPTIONAL:
# We can probably assume that the most-indicative drugs are among
# the most-prescribed; get just the 100 most-prescribed drugs
# Note: you have to get the nf, yf values before doing this!
tab = tab.sort_values("Yes", ascending=False)[:100]
# and apply the function
tab["P-value"] = tab.apply(p_value, axis=1)
# find the best match
best_med = tab.sort_values("P-value").index[0]
most_indicative[disease] = best_med
This now runs in about 18 minutes on my machine, and you could probably combine it with EM28's answer to speed it up by another factor of 4 or more.

Categories