fuzzywuzzy to normalize string in pandas column - python

I have a dataframe like this
now i want to normalize the string in the 'comments' column for the word 'election' . I tried using fuzzywuzzy but wasn't able to implement it on pandas dataframe to partially match the word 'election'. The output dataframe should have the word 'election' in the 'comments' column like this
Assume that i have around 100k rows and possible combinations for the word 'election' can be many.
Kindly guide me on this part.

with the answer you gave, you can use pandas apply, stack and groupby functions to accelerate your code. you have input such as:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Merchant details': ['Alpha co','Bravo co'],
'Comments':['electionsss are around',
'vote in eelecttions']})
For the column 'comments', you can create a temporary mutiindex DF containing a word per row by splitting and using stack function:
df_temp = pd.DataFrame(
{'split_comments':df['Comments'].str.split(' ',expand=True).stack()})
Then you create the column with corrected word (according to your idea), using apply and the comparision of fuzz.ratio:
df_temp['corrected_comments'] = df_temp['split_comments'].apply(
lambda wd: 'election' if fuzz.ratio(wd, 'election') > 75 else wd)
Finally, you write back in your column Comments of df with the corrected data using groupby and join functions:
df['Comments'] = df_temp.reset_index().groupby('level_0').apply(
lambda wd: ' '.join(wd['corrected_comments']))

Don't operate on the dataframe. The overhead will kill you. Turn the column into a list, then iteratecover that. And finally assign that list back to the column.

Ok i tried this myself and came up with this code -
for i in range(len(df)):
a = []
a = df.comments[i].split()
for j in word:
for k in range(len(a)):
if fuzz.ratio(j,a[k]) > 75:
a[k] = j
df.comments[i] = a
df.comments[i] = ' '.join(df.comments[i])
But this approach seems slow for a large dataframe.
Can someone provide a better pythonic way of implementing this.

Related

TypError: sequence item 0: expected str instance, tuple found

I have a column with tuples which I would like to remove the brackets from.
Example
words
(hello,me)
(what,can)
(ring, dog)
I have tried this:
df['words'].agg(','.join)
Unfortunately I receive the error in the title.
I would like this output:
words
hello,me
what,can
ring, dog
Any solution?
Also, strangely enough, with a different dataset that line of code works. Any ideas why?
I think you can use df.apply to update the words column with the new value by applying a function to modify the value of each row
import pandas as pd
df = pd.DataFrame({'words': [('hello','me'), ('what','can')]})
df['words'] = df.apply (lambda row: ','.join(row[0]), axis=1)
Edit: come to think of it, your original approach using df['words'].agg should also work but you need to assign it the words column for it to make change to the dataframe
import pandas as pd
df = pd.DataFrame({'words': [('hello','me'), ('what','can')]})
df['words'] = df['words'].agg(','.join)
print(df)

How to extract inside of column to several columns

I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters

Efficient way to unnest pandas dataframe

I'm accessing a fairly large series of json files and storing them in a pandas series, part of a larger dataframe. There are several fields I want in said json, some of which are nested. I've been extracting them using json_normalize. The goal in to then merge these new fields with my original dataframe.
My problem is when I do so, instead of getting a dataframe with J rows and K columns, I get a J length series with each element being 1xK dataframe. I'm wondering if there is either an efficient vectorized way to turn this nested series/dataframe into a regular dataframe or get a regular dataframe from the start.
I've used map/lambda to create my nested series. Right now I'm unnesting with iteritems/append, but there has to be a more efficient way.
url_base = 'http:\\foo.bar='
df['http'] = df['id'].map(lambda x: url_base + x)
df['json'] = df['http'].map(lambda x: nf.get_json(x))
nest_ser = df['json'].map(lambda x: json_normalize(x))
df = pd.DataFrame()
for index, item in nest_ser.iteritems():
df = df.append(item)
json_normalize produces:
pd.Series([pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]))
instead of
pd.DataFrame([col1,col2...])
suppose your name of the output series out of json_normalize is sr:
pd.concat(sr.tolist())

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

Dropping an item from a DataFrame vector column

I have a DataFrame with a single column 'value'. I want to split it by space, remove the first item from the split, and recombine the remaining items into a vector column.
It's very easy to do with a UDF or by converting to and from RDD, but I want to use only DataFrame API for performance and code simplicity reasons.
The best I could do was this:
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
df = sqlContext.createDataFrame([['10 11 12']], ['value'])
df_split = df.select(F.split('value', ' ').alias('split'))
n = df_split.select(F.size(df_split['split'])).collect()[0][0]
df_columns = df_split.select([F.col('split')[i].astype('int').alias(str(i)) for i in range(1, n)])
v = VectorAssembler(inputCols=[str(i) for i in range(1, n)], outputCol='result')
df_result = v.transform(df_columns).select('result')
It works, but requires an extra action (to get the size of the column after split), and a lot of code for such a simple task. Is there a simpler way of doing this?
In addition, VectorAssembler won't work for non-numeric types.
Spark 2.0.0, python 3.5.

Categories