Edit distance between two pandas columns - python

I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.
from nltk.metrics import edit_distance
df['edit'] = edit_distance(df['column1'], df['column2'])
For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.
Any suggestions are welcome.

The nltk's edit_distance function is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, apply it separately to each row's strings like this:
results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)
Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:
results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)
To add the results to your dataframe, you'd use it like this:
df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)

Related

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Here is a dummy example of the DF I'm working with ('ETC' represents several columns):
df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
'ETC':['a','b','c','d','e','f','g','h','i']})
picture of df (sorry not enough reputation yet)
And here is what I want to end up with:
df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
'INVYR':[2000,2000,2000,1990,1990,2001],
'ETC':['a','b','c','e','g','i']})
picture of df1
NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.
So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:
df1 = []
for i in df['PlotCode'].unique():
j = df[df['PlotCode']==i]
k = j[j['INVYR']==j['INVYR'].min()]
df1.append(k)
df1 = pd.concat(df1)
This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.
Thank you in advance!
Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]

How to get the difference of 2 lists in a Pandas DataFrame?

I'm new to python Pandas. I faced a problem to find the difference for 2 lists within a Pandas DataFrame.
Example Input with ; separator:
ColA; ColB
A,B,C,D; B,C,D
A,C,E,F; A,C,F
Expected Output:
ColA; ColB; ColC
A,B,C,D; B,C,D; A
A,C,E,F; A,C,F; E
What I want to do is similar to:
df['ColC'] = np.setdiff1d(df['ColA'].str.split(','), df['ColB'].str.split(','))
But it returns an error:
raise ValueError('Length of values does not match length of index',data,index,len(data),len(index))
Kindly advise
You can apply a lambda function on the DataFrame to find the difference like this:
import pandas as pd
# creating DataFrame (can also be loaded from a file)
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
# apply a lambda function to get the difference
df['ColC'] = df[['ColA','ColB']].apply(lambda x: [i for i in x[0] if i not in x[1]], axis=1)
Please notice! this will find the asymmetric difference ColA - ColB
Result:
A lot faster way to do this would be a simple set subtract:
import pandas as pd
#Creating a dataframe
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
#Finding the difference
df['ColC']= df['ColA'].map(set)-df['ColB'].map(set)
As the dataframe grows in row numbers, it will be computationally pretty expensive to do any row by row operation.

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

How do you effectively use pd.DataFrame.apply on rows with duplicate values?

The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')

Change CSV numerical values based on header name in file using Python

I have a .csv file filled with observations from sensors in the field. The sensors write data as millimeters and I need it as meters to import into another application. My idea was to use Python and possibly pandas to:
1. Read in the .csv as dataframe
2. Find the headers of the data I need to modify (divide each actual value by 1000)
3. Divide each value in the chosen column by 1000 to convert it to meters
4. Write the resulting updated file to disk
Objective: I need to modify all the values except those with a header that contains "rad" in it.
This is what the data looks like:
Here is what I have done so far:
Read data into a dataframe:
import pandas as pd
import numpy as np
delta_df = pd.read_csv('SAAF_121581_67_500.dat',index_col=False)
Filter out all the data that I don't want to touch:
delta_df.filter(like='rad', axis=1)
Here is where I got stuck as I couldn't filter the dataframe to
not like = 'rad'
How can I do this?
Its easier if you post the dataframe rather than the image as the image is not reproducible.
You can use dataframe.filter to keep all the columns containing 'rad'
delta_df = delta_df.select(lambda x: re.search('rad', x), axis=1)
Incase you are trying to remove all the columns containing 'rad', use
delta_df = delta_df.select(lambda x: not re.search('rad', x), axis=1)
Alternate solution without regex:
df.filter(like='rad',axis=1)
EDIT:
Given the dataframes containing rad and not containing rad like this
df_norad = df.select(lambda x: not re.search('rad', x), axis=1)
df_rad = df.select(lambda x: re.search('rad', x), axis=1)
You can convert the values of df_norad df to meters and then merge it with df_rad
merged = pd.concat([df_norad, df_rad], axis = 1)
You can convert the dataframe merged to csv using to_csv
merged.to_csv('yourfilename.csv')
Off the top of my head I believe you can do something like this:
delta_df.filter(regex='^rad', axis=1)
Where we use the regex parameter instead of the like parameter (**note regex and like are mutually exclusive).
The actual regex selects everything that does not match what follow the '^' operator.
Again, I don't have an environment set up to test this but I hope this motivates the idea well enough.

Categories