Replacing similar strings with single unified string in python

Replacing similar strings with single unified string in python - python

Currently working on a data science project and I'm having trouble with data preparation.
Specifically this one: What's Cooking?
The dataset has strings like 'medium eggs', 'large free range egg', 'eggplants', 'large egg whites', 'chinese egg noodles' and 'eggs'
So in this case, I would like to find and replace all the 'medium eggs' and 'large free range egg' as just 'eggs' while strings like 'eggplants' and 'chinese egg noodles' are supposed to be left alone. I would also need to replace 'large egg whites' as 'egg whites'
Another case would be 'garbanzo beans' and 'chick peas' since they refer to the same ingredient.
The initial attempt was just to find any string with 'egg' in its string and replace it, but because there are so many conditions, I'm not sure what kind of approach to take now.
Since this is a classification project, the code needs to be able to take potential ingredients like 'small egg' and still understand it as 'eggs'

This can be done most cleanly with regex, checking for spaces on either side of the query string:
import re
def replace_eggs(string_to_replace, replacement_text, *query_strings):
for query_string in query_strings:
return re.sub(f"\s?{query_string}([\.,]?)\s?",replacement_text, string_to_replace)
WARNING: This code is very bad. It doesn't work very well, and I don't have enough time to fix it. I am sorry. I would suggest learning about regex and catch groups to do this a bit better. Just to re-iterate (ba-dum ching!), I'm sorry but I have many things to do.

As a partial solution, you could write a simple function using this:
import spacy
items = ['medium eggs', 'large free range egg', 'eggplants', 'large egg whites', 'chinese egg noodles', 'eggs']
clean = []
for i in items:
doc = nlp(i)
temp = ''
for token in doc:
#print(token.text , token.pos_)
if token.pos_=='NOUN' or token.pos_=='PROPN':
temp += ' ' + token.text
clean.append(temp)
print(clean)
Output: [' eggs', ' range egg', ' eggplants', ' egg whites', ' egg noodles', ' eggs']
NOTE: You might need to take care of a few cases like 'garbanzo beans' and 'chick peas' manually

Related

Templates with argument in string formatting

I'm looking for a package or any other approach (other than manual replacement) for the templates within string formatting.
I want to achieve something like this (this is just an example so you could get the idea, not the actual working code):
text = "I {what:like,love} {item:pizza,space,science}".format(what=2,item=3)
print(text)
So the output would be:
I love science
How can I achieve this? I have been searching but cannot find anything appropriate. Probably used wrong naming terms.
If there isnt any ready to use package around I would love to read some tips on the starting point to code this myself.

I think using list is sufficient since python lists are persistent
what = ["like","love"]
items = ["pizza","space","science"]
text = "I {} {}".format(what[1],items[2])
print(text)
output:
I love science

My be use a list or a tuple for what and item as both data types preserve insertion order.
what = ['like', 'love']
item = ['pizza', 'space', 'science']
text = "I {what} {item}".format(what=what[1],item=item[2])
print(text) # I like science
or even this is possible.
text = "I {what[1]} {item[2]}".format(what=what, item=item)
print(text) # I like science
Hope this helps!

Why not use a dictionary?
options = {'what': ('like', 'love'), 'item': ('pizza', 'space', 'science')}
print("I " + options['what'][1] + ' ' + options['item'][2])
This returns: "I love science"
Or if you wanted a method to rid yourself of having to reformat to accommodate/remove spaces, then incorporate this into your dictionary structure, like so:
options = {'what': (' like', ' love'), 'item': (' pizza', ' space', ' science'), 'fullstop': '.'}
print("I" + options['what'][0] + options['item'][0] + options['fullstop'])
And this returns: "I like pizza."

Since no one have provided an appropriate answer that answers my question directly, I decided to work on this myself.
I had to use double brackets, because single ones are reserved for the string formatting.
I ended up with the following class:
class ArgTempl:
def __init__(self, _str):
self._str = _str
def format(self, **args):
for k in re.finditer(r"{{(\w+):([\w,]+?)}}", self._str,
flags=re.DOTALL | re.MULTILINE | re.IGNORECASE):
key, replacements = k.groups()
if not key in args:
continue
self._str = self._str.replace(k.group(0), replacements.split(',')[args[key]])
return self._str
This is a primitive, 5 minute written code, therefore lack of checks and so on. It works as expected and can be improved easly.
Tested on Python 2.7 & 3.6~
Usage:
test = "I {{what:like,love}} {{item:pizza,space,science}}"
print(ArgTempl(test).format(what=1, item=2))
> I love science
Thanks for all of the replies.

String to phrase replacement python

I have a text string and I want to replace two words with a single word. E.g. if the word is artificial intelligence, I want to replace it with artificial_intelligence. This needs to be done for a list of 200 words and on a text file of size 5 mb.
I tried string.replace but it can work only for one element, not for the list.
Example
Text='Artificial intelligence is useful for us in every situation of deep learning.'
List a : list b
Artificial intelligence: artificial_intelligence
Deep learning: deep_ learning
...
Text.replace('Artificial intelligence','Artificial_intelligence') is working.
But
For I in range(len(Lista)):
Text=Text.replace(Lista[I],List b[I])
doesn't work.

I would suggest using a dict for your replacements:
text = "Artificial intelligence is useful for us in every situation of deep learning."
replacements = {"Artificial intelligence" : "Artificial_intelligence",
"deep learning" : "deep_learning"}
Then your approach works (although it is case-sensitive):
>>> for rep in replacements:
text = text.replace(rep, replacements[rep])
>>> print(text)
Artificial_intelligence is useful for us in every situation of deep_learning.
For other approaches (like the suggested regex-approach), have a look at SO: Python replace multiple strings.

Since you have a case problem between your list entries and your string, you could use the re.sub() function with IGNORECASE flag to obtain what you want:
import re
list_a = ['Artificial intelligence', 'Deep learning']
list_b = ['artificial_intelligence', 'deep_learning']
text = 'Artificial intelligence is useful for us in every situation of deep learning.'
for from_, to in zip(list_a, list_b):
text = re.sub(from_, to, text, flags=re.IGNORECASE)
print(text)
# artificial_intelligence is useful for us in every situation of deep_learning.
Note the use of the zip() function wich allows to iterate over the two lists in the same time.
Also note that Christian is right, a dict would be more suitable for your substitution data. The previous code would then be the following for the exact same result:
import re
subs = {'Artificial intelligence': 'artificial_intelligence',
'Deep learning': 'deep_learning'}
text = 'Artificial intelligence is useful for us in every situation of deep learning.'
for from_, to in subs.items():
text = re.sub(from_, to, text, flags=re.IGNORECASE)
print(text)

ordering text by the agreement with keywords

I am trying to order a number of short paragraphs by their agreement with a list of keywords. This is used to provide a user with the text ordered by interest.
Let's assume I already have the list of keywords, hopefully reflecting the users interest. I thought this is a fairly standard procedure and expected some python package for that. But so far my Google search was not very successful.
I can easily come up with a brute force solution myself, but I was wondering whether somebody knows an efficient way to do this?
EDIT:
Ok here is an example:
keywords = ['cats', 'food', 'Miau']
text1 = 'This is text about dogs'
text2 = 'This is text about food'
text3 = 'This is text about cat food'
I need a procedure which leads to the order text3, text2, text1
thanks

This is the simplest thing I can think of:
import string
input = open('document.txt', 'r')
text = input.read()
table = string.maketrans("","")
text = text.translate(table, string.punctuation)
wordlist = text.split()
agreement_cnt = 0
for word in list_of_keywords:
agreement_cnt += wordlist.count(word)
got the removing punctuation bit from here: Best way to strip punctuation from a string in Python.

Something like this might be a good starting point:
>>> keywords = ['cats', 'food', 'Miau']
>>> text1 = 'This is a text about food fed to cats'
>>> matched_word_count = len(set(text1.split()).intersection(set(keywords)))
>>> print matched_word_count
2
If you want to correct for capitalization or capture word forms (i.e. 'cat' instead of 'cats'), there's obviously more to consider, though.
Taking the above and capturing match counts for a list of different strings, and then sorting the results to find the "best" match, should be relatively simple.

Negation handling in sentiment analysis

I am in need of a little help here, I need to identify the negative words like "not good","not bad" and then identify the polarity (negative or positive) of the sentiment. I did everything except handling the negations. I just want to know how I can include negations into it. How do I go about it?

Negation handling is quite a broad field, with numerous different potential implementations. Here I can provide sample code that negates a sequence of text and stores negated uni/bi/trigrams in not_ form. Note that nltk isn't used here in favor of simple text processing.
# negate_sequence(text)
# text: sentence to process (creation of uni/bi/trigrams
# is handled here)
#
# Detects negations and transforms negated words into 'not_' form
#
def negate_sequence(text):
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
stripped = word.strip(delims).lower()
negated = "not_" + stripped if negation else stripped
result.append(negated)
if prev:
bigram = prev + " " + negated
result.append(bigram)
if pprev:
trigram = pprev + " " + bigram
result.append(trigram)
pprev = prev
prev = negated
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
negation = False
return result
If we run this program on a sample input text = "I am not happy today, and I am not feeling well", we obtain the following sequences of unigrams, bigrams, and trigrams:
[ 'i',
'am',
'i am',
'not',
'am not',
'i am not',
'not_happy',
'not not_happy',
'am not not_happy',
'not_today',
'not_happy not_today',
'not not_happy not_today',
'and',
'not_today and',
'not_happy not_today and',
'i',
'and i',
'not_today and i',
'am',
'i am',
'and i am',
'not',
'am not',
'i am not',
'not_feeling',
'not not_feeling',
'am not not_feeling',
'not_well',
'not_feeling not_well',
'not not_feeling not_well']
We may subsequently store these trigrams in an array for future retreival and analysis. Process the not_ words as negative of the [sentiment, polarity] that you have defined for their counterparts.

this seems to be working decently well as a poor man's word negation in python. it's definitely not perfect, but may be useful for some cases. it takes a spacy sentence object.
def word_is_negated(word):
""" """
for child in word.children:
if child.dep_ == 'neg':
return True
if word.pos_ in {'VERB'}:
for ancestor in word.ancestors:
if ancestor.pos_ in {'VERB'}:
for child2 in ancestor.children:
if child2.dep_ == 'neg':
return True
return False
def find_negated_wordSentIdxs_in_sent(sent, idxs_of_interest=None):
""" """
negated_word_idxs = set()
for word_sent_idx, word in enumerate(sent):
if idxs_of_interest:
if word_sent_idx not in idxs_of_interest:
continue
if word_is_negated(word):
negated_word_idxs.add(word_sent_idx)
return negated_word_idxs
call it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
find_negated_wordSentIdxs_in_sent(nlp("I have hope, but I do not like summer"))
EDIT:
As #Amandeep pointed out, depending on your use case, you may also want to include NOUNS, ADJECTIVES, ADVERBS in the line: if word.pos_ in {'VERB'}:.

It's been a while since I've worked on sentiment analysis, so not sure what the status of this area is now, and in any case I have never used nltk for this. So I wouldn't be able to point you to anything there. But in general, I think it's safe to say that this is an active area of research and an essential part of NLP. And that surely it isn't a problem that has been 'solved' yet. It's one of the finer, more interesting fields of NLP, involving irony, sarcams, scope (of negations). Often, coming up with a correct analysis means interpreting a lot of context/domain/discourse information. Which isn't straightforward at all.
You may want to look at this topic: Can an algorithm detect sarcasm. And some googling will probably give you a lot more information.
In short; your question is way too broad to come up with a specific answer.
Also, I wonder what you mean with "I did everything except handling the negations". You mean you identified 'negative' words? Have you considered that this information can be conveyed in a lot more than the words not, no, etc? Consider for example "Your solution was not good" vs. "Your solution was suboptimal".
What exactly you are looking for, and what will suffice in your situation, obivously depends on context and domain of application.
This probably wasn't the answer you were hoping for, but I'd suggest you do a bit more research (as a lot of smart things have been done by smart people in this field).

identifying strings which cant be spelt in a list item

I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?

You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.

It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing similar strings with single unified string in python - python

Related

Templates with argument in string formatting

String to phrase replacement python

ordering text by the agreement with keywords

Negation handling in sentiment analysis

identifying strings which cant be spelt in a list item

Categories

Resources