I have the following sequences of strings within a column in pandas:
SEQ
An empty world
So the word is
So word is
No word is
I can check the similarity using fuzzywuzzy or cosine distance.
However I would like to know how to get information about the word which changes position from amore to another.
For example:
Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3.
They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th.
How can I see the changes between two rows/texts?
Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.
The general concept is:
find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
apply some html formatting to the tokens shared between the two strings.
apply this to all rows.
output the resulting dataframe as html and render it in ipython.
import pandas as pd
data = ['An empty world',
'So the word is',
'So word is',
'No word is']
df = pd.DataFrame(data, columns=['phrase'])
bold = lambda x: f'<b>{x}</b>'
def highlight_shared(string1, string2, format_func):
shared_toks = set(string1.split(' ')) & set(string2.split(' '))
return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
from IPython.core.display import HTML
HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))
Related
I have a Dataframe as below and i wish to detect the repeated words either in split or non split words:
Table A:
Cat Comments
Stat A power down due to electric shock
Stat A powerdown because short circuit
Stat A top 10 on re work
Stat A top10 on rework
I wish to get the output as below:
Repeated words= ['Powerdown', 'top10','on','rework']
Anyone have ideas?
I assume that having the words in a dataframe column is not really relevant for the problem at hand. I will therefore transfer them into a list, and then search for repeats.
import pandas as pd
df = pd.DataFrame({"Comments": ["power down due to electric shock", "powerdown because short circuit", "top 10 on re work", "top10 on rework"]})
words = df['Comments'].to_list()
This leads to
['power down due to electric shock',
'powerdown because short circuit',
'top 10 on re work',
'top10 on rework']
Now we create a new list to account for the fact that "top 10" and "top10" should be treated equal:
newa = []
for s in words:
a = s.split()
for i in range(len(a)-1):
w = a[i]+a[i+1]
a.append(w)
newa.append(a)
which yields:
[['power',
'down',
'due',
'to',
'electric',
'shock',
'powerdown',
'downdue',
'dueto',
'toelectric',
'electricshock'],...
Finally we flatten the list and use Counter to find words which occur more than once:
from collections import Counter
from itertools import chain
wordList = list(chain(*newa))
wordCount = Counter(wordList)
[w for w,c in wordCount.most_common() if c>1]
leading to
['powerdown', 'on', 'top10', 'rework']
Let's try:
words = df['Comments'].str.split(' ').explode()
biwords = words + words.groupby(level=0).shift(-1)
(pd.concat([words.groupby(level=0).apply(pd.Series.drop_duplicates), # remove duplicates words within a comment
biwords.groupby(level=0).apply(pd.Series.drop_duplicates)]) # remove duplicate bi-words within a comment
.dropna() # remove NaN created by shifting
.to_frame().join(df[['Cat']]) # join with original Cat
.loc[lambda x: x.duplicated(keep=False)] # select the duplicated `Comments` within `Cat`
.groupby('Cat')['Comments'].unique() # select the unique values within each `Cat`
)
Output:
Cat
Stat A [powerdown, on, top10, rework]
Name: Comments, dtype: object
I need some help please.
I have a dataframe with multiple columns where 2 are:
Content_Clean = Column filled with Content - String
Removals: list of strings to be removed from Content_Clean Column
Problem: I am trying to replace words in Content_Clean with spaces if in Removals Column:
Example Image
Example:
Content Clean: 'Johnny and Mary went to the store'
Removals: ['Johnny','Mary']
Output: 'and went to the store'
Example Code:
for i in data_eng['Removals']:
for u in i:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].str.replace(u,' ')
This does not work as Removals columns contain lists per row.
Another Example:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].apply(lambda x: re.sub(data_eng.loc[data_eng['Content_Clean'] == x, 'Removals'].values[0], '', x))
Does not work as this code is only looking for one string.
The problem is that Removals column is a list that I want use to remove/ replace with spaces in the Content_Clean column on a per row basis.
The example image link might help
Here you go. This worked on my test data. Let me know if it works for you
def repl(row):
for word in row['Removals']:
row['Content_Clean'] = row['Content_Clean'].replace(word, '')
return row
data_eng = data_eng.apply(repl, axis=1)
You can call the str.replace(old, new) method to remove unwanted words from a string.
Here is one small example I have done.
a_string = "I do not like to eat apples and watermelons"
stripped_string = a_string.replace(" do not", "")
print(stripped_string)
This will remove "do not" from the sentence
Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.
Explanation.
I got examples from here.
Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
What I call here window is the length of the span of words in which the words can appear. as follows:
'tin tan' is hit (within 4 words)
'tin dog tan' is hit (within 4 words)
'tin dog cat tan is hit (within 4 words)
'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.
I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
where vocabulary is a simple list of strings, being single words or several words.
besides from scikitlearn:
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
does not help neither.
Any ideas?
I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
SAMPLE:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
your df will look like this:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
Now you can apply contained_within_window as follows:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
And you get:
2
You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.
I have two pandas dataframes. One contains text, the other a set of terms i'd like to search for and replace within the text. There are many permutations of the text where the same word can appear multiple times in the text and have multiple terms.
I have created a loop which is able to replace each word in the text with a term however it's very slow, especially given that it is working over a large corpus.
My question is:
Is there a way of running the below function in a parallelised for speed? Alternatively could the function use Numba or some other type of optimisation to speed it up? NB note the fact that there can be many permutations within the text that need replacing.
Example text dataframe:
d = {'ID': [1, 2, 3], 'Text': ['here is some random text', 'random text here', 'more random text']}
text_df = pd.DataFrame(data=d)
Example terms dataframe:
d = {'Replace_item': ['<RANDOM_REPLACED>', '<HERE_REPLACED>', '<SOME_REPLACED>'], 'Text': ['random', 'here', 'some']}
replace_terms_df = pd.DataFrame(data=d)
Example of current solution:
def find_replace(text, terms):
for _, row in terms.iterrows():
term = row['Text']
item = row['Replace_item']
text.Text = text.Text.str.replace(term, item)
return text
find_replace(text_df, replace_terms_df)
Please let me know if anything above requires clarifying. Thank you,
You can use vectorized method: Series.replace(lst1, lst2, regex=True)
In [90]: (text_df.Text
.replace(replace_terms_df.Text.tolist(),
replace_terms_df.Replace_item.tolist(),
regex=True))
Out[90]:
0 <HERE_REPLACED> is <SOME_REPLACED> <RANDOM_REP...
1 <RANDOM_REPLACED> text <HERE_REPLACED>
2 more <RANDOM_REPLACED> text
Name: Text, dtype: object
top_N = 100
words = review_tip['user_tip'].dropna()
words = words.astype(str)
words = words.str.replace('[{}]'.format(string.punctuation), '')
words = words.str.lower().apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))
# replace '|'-->' ' and drop all stopwords
words = words.str.lower().replace([r'\|', RE_stopwords], [' ', ''], regex=True).str.cat(sep=' ').split()
# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
plt.clf()
# plot
rslt.plot.bar(rot=90, figsize=(16,10), width=0.8)
plt.show()
Frequency
Word
great 17069
food 16381
good 12502
service 11342
place 10841
best 9280
get 7483
love 7042
amazing 5043
try 4945
time 4810
go 4594
dont 4377
As you can see the words are singular which is something I can use, but is it possible to take like two words that couldve been used together a lot?
For example getting
dont go (this could be for 100 times)
instead of getting it separate
dont 100
go 100
This will generate bi-grams, is this what you are looking for:
bi_grams = zip(words, words[1:])
I generates tuples which is fine to use in the counter but you could also easily tweak the code to use ' '.join((a, b)).