Parallelise string replace in pandas - python

I have two pandas dataframes. One contains text, the other a set of terms i'd like to search for and replace within the text. There are many permutations of the text where the same word can appear multiple times in the text and have multiple terms.
I have created a loop which is able to replace each word in the text with a term however it's very slow, especially given that it is working over a large corpus.
My question is:
Is there a way of running the below function in a parallelised for speed? Alternatively could the function use Numba or some other type of optimisation to speed it up? NB note the fact that there can be many permutations within the text that need replacing.
Example text dataframe:
d = {'ID': [1, 2, 3], 'Text': ['here is some random text', 'random text here', 'more random text']}
text_df = pd.DataFrame(data=d)
Example terms dataframe:
d = {'Replace_item': ['<RANDOM_REPLACED>', '<HERE_REPLACED>', '<SOME_REPLACED>'], 'Text': ['random', 'here', 'some']}
replace_terms_df = pd.DataFrame(data=d)
Example of current solution:
def find_replace(text, terms):
for _, row in terms.iterrows():
term = row['Text']
item = row['Replace_item']
text.Text = text.Text.str.replace(term, item)
return text
find_replace(text_df, replace_terms_df)
Please let me know if anything above requires clarifying. Thank you,

You can use vectorized method: Series.replace(lst1, lst2, regex=True)
In [90]: (text_df.Text
.replace(replace_terms_df.Text.tolist(),
replace_terms_df.Replace_item.tolist(),
regex=True))
Out[90]:
0 <HERE_REPLACED> is <SOME_REPLACED> <RANDOM_REP...
1 <RANDOM_REPLACED> text <HERE_REPLACED>
2 more <RANDOM_REPLACED> text
Name: Text, dtype: object

Related

Sentence comparison: how to highlight differences

I have the following sequences of strings within a column in pandas:
SEQ
An empty world
So the word is
So word is
No word is
I can check the similarity using fuzzywuzzy or cosine distance.
However I would like to know how to get information about the word which changes position from amore to another.
For example:
Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3.
They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th.
How can I see the changes between two rows/texts?
Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.
The general concept is:
find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
apply some html formatting to the tokens shared between the two strings.
apply this to all rows.
output the resulting dataframe as html and render it in ipython.
import pandas as pd
data = ['An empty world',
'So the word is',
'So word is',
'No word is']
df = pd.DataFrame(data, columns=['phrase'])
bold = lambda x: f'<b>{x}</b>'
def highlight_shared(string1, string2, format_func):
shared_toks = set(string1.split(' ')) & set(string2.split(' '))
return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
from IPython.core.display import HTML
HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))

check if token (on pandas column) is in an external list of trigrams

I have a dataframe whith a column "token" that has a single word extracted by several texts, for example:
text = "hello it's me"
df['token']
0 hello
1 it
2 '
3 s
4 me
the dataframe is very long because I have 1000 sentences, and all are splitted and divided like I wrote above.
Now I have a list of trigrams, like ['no way out', 'my life is', 'hello my name']
I want to check if those sentences start with one of the trigram in the list, how can I do that?
It seems to me that a much better data structure for the tokens is a set. By defining a set from df.token, you can reduce the lookup complexity to O(1), and since you only need to iterate over the list of sentences, this leaves us with a O(len(l)) approach:
tokens = set(df.token.values.tolist())
l = ['no way out', 'my life is', 'hello my name']
[i.split(maxsplit=1)[0] in tokens for i in l]
# [False, False, True]

Flattening 3D list of words to 2D

I have a pandas column with text strings. For simplicity ,lets assume I have a column with two strings.
s=["How are you. Don't wait for me", "this is all fine"]
I want to get something like this:
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
Basically take each sentence of a document and tokenism into list of words. So finally I need list of list of string.
I tried using a map like below:
nlp=spacy.load('en')
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
global log_txt
x=re.sub("\s\s+" , " ", x.strip())
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
#log_txt=log_txt.extend(txt_to_words)
return txt_to_words
The nlp from spacy is used to split a string of text into list of sentences.
log_txt=list(map(text_to_words,s))
log_txt
But this as you know would put all of the result from both the documents into another list
[[['How', 'are', 'you'], ["Don't", 'wait', 'for', 'me']],
[['this', 'is', 'all', 'fine']]]
You'll need a nested list comprehension. Additionally, you can get rid of punctuation using re.sub.
import re
data = ["How are you. Don't wait for me", "this is all fine"]
words = [
re.sub([^a-z\s], '', j.lower()).split() for i in data for j in nlp(i).sents
]
Or,
words = []
for i in data:
... # do something here
for j in nlp(i).sents:
words.append(re.sub([^a-z\s], '', j.lower()).split())
There is a much simpler way for list comprehension.
You can first join the strings with a period '.' and split them again.
[x.split() for x in '.'.join(s).split('.')]
It will give the desired result.
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
For Pandas dataframes, you may get an object, and hence a list of lists after tolist function in return. Just extract the first element.
For example,
import pandas as pd
def splitwords(s):
s1 = [x.split() for x in '.'.join(s).split('.')]
return s1
df = pd.DataFrame(s)
result = df.apply(splitwords).tolist()[0]
Again, it will give you the preferred result.
Hope it helps ;)

Why isn't there function to count Document Frequency (DF) in NLTK?

I am looking for a function to get the DF of certain term (meaning how many documents contain a certain word in a corpus), but I can't seem to find the function here. The page only has function to get values of tf, idf, and tf_idf. I am looking specifically for DF only. I copied the code below from the documentation,
matches = len([True for text in self._texts if term in text])
but I don't like the result it gives. For example if I have a list of strings and I am looking for the word Pete, it also includes the name Peter which is not I want. For example.
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
So I am looking for pete which appears TWICE, but the code I showed above will tell you that there are THREE pete's because it also counts peter. How do I solve this? Thanks.
Your description is incorrect. The expression you posted does indeed give 1, not 3, when you search for pete in texts:
>>> texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
>>> len([True for text in texts if 'pete' in text])
1
The only way you could have matched partial words is if your texts were not tokenized (i.e. if texts is a list of strings, not a list of token lists).
But the above code is terrible, it builds a list for no reason at all. A better (and more conventional) way to count hits is this:
>>> sum(1 for text in texts if 'pete' in text))
1
As for the question that you pose (Why (...)?) : I don't know.
As a solution to your example (noting that peter occurs twice and pete just once:
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
def flatten(l):
out = []
for item in l:
if isinstance(item, (list, tuple)):
out.extend(flatten(item))
else:
out.append(item)
return out
flat = flatten(texts)
len([c for c in flat if c in ['pete']])
len([c for c in flat if c in ['peter']])
Compare the two results
Edit:
import collections
def counts(listr, word):
total = []
for i in range(len(texts)):
total.append(word in collections.Counter(listr[i]))
return(sum(total))
counts(texts,'peter')
#2

Create list from list with function in pandas dataframe

I would like to create a new pandas column by running a word stemming function over a list of words in another column. I can tokenize a single string by using apply and lambda, but I cannot figure out how to extrapolate this to the case of running it over a list of words.
test = {'Statement' : ['congratulations on the future','call the mechanic','more text'], 'Other' : [2,3,4]}
df = pd.DataFrame(test)
df['tokenized'] = df.apply (lambda row: nltk.word_tokenize(row['Statement']), axis=1)
I know I could solve it with a nested for loop, but that seems inefficient and results in a SettingWithCopyWarning:
df['stems'] = ''
for x in range(len(df)):
print(len(df['tokenized'][x]))
df['stems'][x] = row_stems=[]
for y in range(len(df['tokenized'][x])):
print(df['tokenized'][x][y])
row_stems.append(stemmer.stem(df['tokenized'][x][y]))
Isn't there a better way to do this?
EDIT:
Here's an example of what the result should look like:
Other Statement tokenized stems
0 2 congratulations on the future [congratulations, on, the, future] [congratul, on, the, futur]
1 3 call the mechanic [call, the, mechanic] [call, the, mechan]
2 4 more text [more, text] [more, text]
No need to run a loop, indeed. At least not an explicit loop. A list comprehension will work just fine.
Assuming you use Porter stemmer ps:
df['stems'] = df['tokenized'].apply(lambda words:
[ps.stem(word) for word in words])

Categories