Python: How to match the words in split and non split? - python

I have a Dataframe as below and i wish to detect the repeated words either in split or non split words:
Table A:
Cat Comments
Stat A power down due to electric shock
Stat A powerdown because short circuit
Stat A top 10 on re work
Stat A top10 on rework
I wish to get the output as below:
Repeated words= ['Powerdown', 'top10','on','rework']
Anyone have ideas?

I assume that having the words in a dataframe column is not really relevant for the problem at hand. I will therefore transfer them into a list, and then search for repeats.
import pandas as pd
df = pd.DataFrame({"Comments": ["power down due to electric shock", "powerdown because short circuit", "top 10 on re work", "top10 on rework"]})
words = df['Comments'].to_list()
This leads to
['power down due to electric shock',
'powerdown because short circuit',
'top 10 on re work',
'top10 on rework']
Now we create a new list to account for the fact that "top 10" and "top10" should be treated equal:
newa = []
for s in words:
a = s.split()
for i in range(len(a)-1):
w = a[i]+a[i+1]
a.append(w)
newa.append(a)
which yields:
[['power',
'down',
'due',
'to',
'electric',
'shock',
'powerdown',
'downdue',
'dueto',
'toelectric',
'electricshock'],...
Finally we flatten the list and use Counter to find words which occur more than once:
from collections import Counter
from itertools import chain
wordList = list(chain(*newa))
wordCount = Counter(wordList)
[w for w,c in wordCount.most_common() if c>1]
leading to
['powerdown', 'on', 'top10', 'rework']

Let's try:
words = df['Comments'].str.split(' ').explode()
biwords = words + words.groupby(level=0).shift(-1)
(pd.concat([words.groupby(level=0).apply(pd.Series.drop_duplicates), # remove duplicates words within a comment
biwords.groupby(level=0).apply(pd.Series.drop_duplicates)]) # remove duplicate bi-words within a comment
.dropna() # remove NaN created by shifting
.to_frame().join(df[['Cat']]) # join with original Cat
.loc[lambda x: x.duplicated(keep=False)] # select the duplicated `Comments` within `Cat`
.groupby('Cat')['Comments'].unique() # select the unique values within each `Cat`
)
Output:
Cat
Stat A [powerdown, on, top10, rework]
Name: Comments, dtype: object

Related

Python; use list of initial characters to retrieve full word from other list?

I'm trying to use the list of shortened words to select & retrieve the corresponding full word identified by its initial sequence of characters:
shortwords = ['appe', 'kid', 'deve', 'colo', 'armo']
fullwords = ['appearance', 'armour', 'colored', 'developing', 'disagreement', 'kid', 'pony', 'treasure']
Trying this regex match with a single shortened word:
import re
shortword = 'deve'
retrieved=filter(lambda i: re.match(r'{}'.format(shortword),i), fullwords)
print(retrieved*)
returns
developing
So the regex match works but the question is how to adapt the code to iterate through the shortwords list and retrieve the full words?
EDIT: The solution needs to preserve the order from the shortwords list.
Maybe using a dictionary
# Using a dictionary
test= 'appe is a deve arm'
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
#Building the dictionary
d={}
for i in range(len(shortwords)):
d[shortwords[i]]=fullwords[i]
# apply dictionary to test
res=" ".join(d.get(s,s) for s in test.split())
# print test data after dictionary mapping
print(res)
That is one way to do it:
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
# Dict comprehension
words = {short:full for short, full in zip(shortwords, fullwords)}
#Solving problem
keys = ['deve','arm','pony']
values = [words[key] for key in keys]
print(values)
This is a classical key - value problem. Use a dictionary for that or consider pandas in long-term.
Your question text seems to indicate that you're looking for your shortwords at the start of each word. That should be easy then:
matched_words = [word for word in fullwords if any(word.startswith(shortword) for shortword in shortwords]
If you'd like to regex this for some reason (it's unlikely to be faster), you could do that with a large alternation:
regex_alternation = '|'.join(re.escape(shortword) for shortword in shortwords)
matched_words = [word for word in fullwords if re.match(rf"^{regex_alternation}", word)]
Alternately if your shortwords are always four characters, you could just slice the first four off:
shortwords = set(shortwords) # sets have O(1) lookups so this will save
# a significant amount of time if either shortwords
# or longwords is long
matched_words = [word for word in fullwords if word[:4] in shortwords]
This snippet has the functionality I wanted. It builds a regular expression pattern at each loop iteration in order to accomodate varying word length parameters. Further it maintains the original order of the wordroots list. In essence it looks at each word in wordroots and fills out the full word from the dataset. This is useful when working with the bip-0039 words list which contains words of 3-8 characters in length and which are uniquely identifiable by their initial 4 characters. Recovery phrases are built by randomly selecting a sequence of words from the bip-0039 list, order is important. Observed security practice is often to abbreviate each word to a maximum of its four initial characters. Here is code which would rebuild a recovery phrase from its abbreviation:
import re
wordroots = ['sun', 'sunk', 'sunn', 'suns']
dataset = ['sun', 'sunk', 'sunny', 'sunshine']
retrieved = []
for root in wordroots:
#(exact match) or ((exact match at beginning of word when root is 4 or more characters) else (exact match))
pattern = r"(^" + root + "$|" + ("^" + root + "[a-zA-Z]+)" if len(root) >= 4 else "^" + root + "$)")
retrieved.extend( filter(lambda i: re.match(pattern, i), dataset) )
print(*retrieved)
Output:
sun sunk sunny sunshine

Sentence comparison: how to highlight differences

I have the following sequences of strings within a column in pandas:
SEQ
An empty world
So the word is
So word is
No word is
I can check the similarity using fuzzywuzzy or cosine distance.
However I would like to know how to get information about the word which changes position from amore to another.
For example:
Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3.
They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th.
How can I see the changes between two rows/texts?
Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.
The general concept is:
find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
apply some html formatting to the tokens shared between the two strings.
apply this to all rows.
output the resulting dataframe as html and render it in ipython.
import pandas as pd
data = ['An empty world',
'So the word is',
'So word is',
'No word is']
df = pd.DataFrame(data, columns=['phrase'])
bold = lambda x: f'<b>{x}</b>'
def highlight_shared(string1, string2, format_func):
shared_toks = set(string1.split(' ')) & set(string2.split(' '))
return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
from IPython.core.display import HTML
HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))

Problem during replacing word/phrases from a text file using a set in python?

Suppose I am having a list (new_list) with 3000 sentences, where each sentences are separated by a comma mark (,).
Example (a part):
new_list = ['air purity controller, to detect pollution and letting cold air in', 'air quality in my home by air conditioning', 'air conditioner depending on home', 'household alarm clock for time']
I want to replace certain words (single word or phrase) from the new_list by adding some special characters (at the start and end). I am doing this by the help of a set.
Example of the set:
dict = {'air conditioner', 'air', 'air quality', 'house', 'air conditioning', 'alarm clock'}
The size of the set (dict) is 317. I want to scan each word of the new_list and replace when there is a match with the set by appending special characters at the starting and end position. Further if a match occur and the resulted word is a phrase from the set, then additionally it would add a special character (_) in between along with appending special character at the both starting and end points.
I have tried but failing. Please suggest where I am going wrong (i don't think so, I am wrong). The new_list and dict are shown above.
import re, csv, nltk
from nltk.corpus import stopwords
from nltk import regexp_tokenize
with open("raw_data.txt", 'r', encoding = 'utf-8') as f1:
reader = csv.reader(f1, skipinitialspace=True)
new_list = next(reader)
with open('updatd_file.txt', 'w', encoding='utf-8') as f2:
dic = {'air conditioner', 'air quality', 'air conditioning', 'air', 'house', 'alarm clock'}
dic = {i : i.replace(' ', '_') for i in dic}
pattern = re.compile(r"\b("+"|".join(dic)+r")\b")
modify_reqs = [pattern.sub(lambda x: "_{}_".format(dic[x.group()]), i) for i in new_list]
sw = (stopwords.words('english'))
unfiltered_tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|[\d]|[^\wa-z+]", gaps=True) if word not in sw] for word in modify_reqs]
f2.write(str(unfiltered_tokens))
I am executing this program and writing the results onto a file. When I check the output file, I can see the words in the desired order (missing few words) but sometimes i am unable to see this. How is this strange behavior, I am unable to understand and explore.
That is sometimes, I am able to find the phrase in the correct order (as expected) that is '_air_conditioning_' but next time when I execute this fragment, i find the same word as '_air_', 'conditioning' (separated). Same thing also happened with the other phrases like air quality, air conditioning, etc. The problem is with the phrases not with the single word.
Please note that in the set (dict) I have 317 words and new_list containing almost 3000 sentences. Not possible to show all here.
How is this possible? I am trying this since 7-8 days its frustrating now.
The comment by #Toto really helped me to resolve the issue.
I have sorted the used set in the descending order of the length of words using the keyword sorted.
dic = sorted(dic, key=len, reverse=True)

Python replace strings using regex on large dataset

I have recently started using the re package in order to clean up transaction descriptions.
Example of original transaction descriptions:
['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
For a list of expressions I would like to replace the current description with a better one, e.g. if one of the list entries contains 'facebook' or 'amazon' preceded by a space (or at the beginning of the string), I want to replace the entire list entry by the word 'facebook' or 'amazon' respectively, i.e.:
['bread', 'facebook', 'milk', 'savings', 'amazon', 'holiday_amazon']
As I only want to pick it up if the word facebook is preceded by a space or if it is at the beginning of a word, I have created regex that represent this, e.g. (^|\s)facebook. Note that this is only an example, in reality I want to filter out more complex expressions as well.
In total I have a dataframe with 90 such expressions that I want to replace.
My current code (with minimum workable example) is:
import pandas as pd
import re
def specialCases(list_of_narratives, replacement_dataframe):
# Create output array
new_narratives = []
special_cases_identifiers = replacement_dataframe["REGEX_TEST"]
# For each string element of the list
for memo in list_of_narratives:
index_count = 0
found_count = 0
for i in special_cases_identifiers:
regex = re.compile(i)
if re.search(regex, memo.lower()) is not None:
new_narratives.append(replacement_dataframe["NARRATIVE_TO"].values[index_count].lower())
index_count += 1
found_count += 1
break
else:
index_count += 1
if found_count == 0:
new_narratives.append(memo.lower())
return new_narratives
# Minimum example creation
list_of_narratives = ['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
list_of_regex_expressions = ['(^|\s)facebook', '(^|\s)amazon']
list_of_regex_replacements = ['facebook', 'amazon']
replacement_dataframe = pd.DataFrame({'REGEX_TEST': list_of_regex_expressions, 'NARRATIVE_TO': list_of_regex_replacements})
# run code
new_narratives = specialCases(list_of_narratives, replacement_dataframe)
However, with over 1 million list entries and 90 different regex expressions to be replaced (i.e. len(list_of_regex_expressions) is 90) this is extremely slow, presumably due to the double for loop.
Could someone help me improve the performance of this code?

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

Categories