String splitting issue problem with multiword expressions

String splitting issue problem with multiword expressions - python

I have a series of strings like:
'i would like a blood orange'
I also have a list of strings like:
["blood orange", "loan shark"]
Operating on the string, I want the following list:
["i", "would", "like", "a", "blood orange"]
What is the best way to get the above list? I've been using re throughout my code, but I'm stumped with this issue.

This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.
(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)
def split_with_phrases(sentence, phrase_list):
words = sentence.split(" ")
phrases = set(tuple(s.split(" ")) for s in phrase_list)
print phrases
max_phrase_length = max(len(p) for p in phrases)
# Find a phrase within words starting at the specified index. Return the
# phrase as a tuple, or None if no phrase starts at that index.
def find_phrase(start_idx):
# Iterate backwards, so we'll always find longer phrases before shorter ones.
# Otherwise, if we have a phrase set like "hello world" and "hello world two",
# we'll never match the longer phrase because we'll always match the shorter
# one first.
for phrase_length in xrange(max_phrase_length, 0, -1):
test_word = tuple(words[idx:idx+phrase_length])
if test_word in phrases:
return test_word
return None
skip = 0
for idx in xrange(len(words)):
if skip:
# This word was returned as part of a previous phrase; skip it.
skip -= 1
continue
phrase = find_phrase(idx)
if phrase is not None:
skip = len(phrase)
yield " ".join(phrase)
continue
yield words[idx]
print [s for s in split_with_phrases('i would like a blood orange',
["blood orange", "loan shark"])]

Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.
list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]
for item in input_list:
for str_lst in list_to_split:
if item in str_lst:
tmp = str_lst.split(item)
lst = []
for itm in tmp:
if itm!= '':
lst.append(itm)
lst.append(item)
print lst
output:
['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']

One quick and dirty, completely un-optimized approach might be to just replace the compounds in the string with a version including a different separator (preferably one that does not occur anywhere else in your target string or compound words). Then split and replace. A more efficient approach would be to iterate only once through the string, matching the compound words where appropriate - but you may have to watch out for instances where there are nested compounds, etc., depending on your array.
#!/usr/bin/python
import re
my_string = "i would like a blood orange"
compounds = ["blood orange", "loan shark"]
for i in range(0,len(compounds)):
my_string = my_string.replace(compounds[i],compounds[i].replace(" ","&"))
my_segs = re.split(r"\s+",my_string)
for i in range(0,len(my_segs)):
my_segs[i] = my_segs[i].replace("&"," ")
print my_segs
Edit: Glenn Maynard's solution is better.

Related

Remove combination of string in dataset in Python

I have a dataset in Python where I want to remove certain combinations of words of colomnX in a new columnY.
Example of 2 rows of columnX:
what is good: the weather what needs improvwement: the house
what is good: everything what needs improvement: nothing
I want tot delete the following combination of words: "what is good" & "what needs improvement".
In the end the following text should remain in the columnY:
the weather the house
everything nothing
I have the following script:
stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ''.join([item in x.split() if item nog in stoplist]))
But it doesn't work. What am I doing wrong here?

In your case the replacement won't happen as the condition if item not in stoplist (in item in x.split() if item not in stoplist) checks if a single word match any phrase of the stoplist, which is wrong.
Instead combine your stop phrases into a regex pattern (for replacement) as shown below:
df['columnY'] = df.columnX.replace(rf"({'|'.join(f'({i})' for i in stoplist)}): ", "", regex=True)
columnX columnY
0 what is good: the weather what needs improveme... the weather the house
1 what is good: everything what needs improvemen... everything nothing

Maybe you can operate on the columns itself.
df["Y"] = df["X"]
df.Y = df.Y.str.replace("what is good", "")
So you would have to do this for every item in your stop list. But I am not sure how many items you have.
So for example
replacement_map = {"what needs improvement": "", "what is good": ""}
for old, new in replacement_map.items():
df.Y = df.Y.str.replace(old, new)
if you need to specify different translations or
items_to_replace = ["what needs improvement", "what is good"]
for item_to_replace in items_to_replace:
df.Y = df.Y.str.replace(item_to_replace, "")
if the item should always be deleted.
Or you can skip the loop if you express it as a regex:
items_to_replace = ["what needs improvement", "what is good"]
replace_regex = r"|".join(item for item in items_to_replace)
df.Y = df.Y.str.replace(replace_regex , "")
(Credits: #MatBailie & #romanperekhrest)

another way without using a regex and to still use apply would be to use a simple function:
def func(s):
for item in stoplist:
s = s.replace(item, '')
return s
df['columnY']=df['columnY'].apply(func)

python: find the last appearance of a word (of a list of words) in a text

given a list of stop words and a string:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
I would like to get the text starting from the end up to the first stop word of the list.
expected result 'the long nice red hair'
I tried with several for loops but it is super cumbersome there should be a straight way, probably even a one liner.
my super verbose solution:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
reversed_sentence =mystring.split()[::-1]
for i,word in enumerate(reversed_sentence):
if word in list_stop_words:
position = i
words = reversed_sentence[0:i+1]
print(' '.join(words[::-1]))
break
for word in mastering[::-1]:
Any suggestion for a better approach?
EDIT AFTER THE ANSWER (SEE BELLOW)

you can try something like this
mystring[max([mystring.rfind(stop_word) for stop_word in list_stop_words]):]
basically you find the last occurence of each word with rfind then you find the last from all the words with max then you slice it out

replace any words in string that match an entry in list with a single tag (python)

I have a list of sentences (~100k sentences total) and a list of "infrequent words" (length ~20k). I would like to run through each sentence and replace any word that matches an entry in "infrequent_words" with the tag "UNK".
(so as a small example, if
infrequent_words = ['dog','cat']
sentence = 'My dog likes to chase after cars'
Then after applying the transformation it should be
sentence = 'My unk likes for chase after cars'
I am having trouble finding an efficient way to do this. This function below (applied to each sentence) works, but it is very slow and I know there must be something better. Any suggestions?
def replace_infrequent_words(text,infrequent_words):
for word in infrequent_words:
text = text.replace(word,'unk')
return text
Thank you!

infrequent_words = {'dog','cat'}
sentence = 'My dog likes to chase after cars'
def replace_infrequent_words(text, infrequent_words):
words = text.split()
for i in range(len(words)):
if words[i] in infrequent_words:
words[i] = 'unk'
return ' '.join(words)
print(replace_infrequent_words(sentence, infrequent_words))
Two things that should improve performance:
Use a set instead of a list for storing infrequent_words.
Use a list to store each word in text so you don't have to scan the entire text string with each replacement.
This doesn't account for grammar and punctuation but this should be a performance improvement from what you posted.

Getting list of string array into separate string arrays in python

This is my code.
SENTENCE = "He sad might have lung cancer. It’s just a rumor."
sent=(sent_tokenize(SENTENCE))
The output is
['He sad might have lung cancer.', 'It’s just a rumor.']
I want to get this array as
['He sad might have lung cancer.']
['It’s just a rumor.']
Is their any way of doing this and if so how?

Since you want to split according to a sentence, you can simply do this:
sentence_list = SENTENCE.split('.')
for sentence in sentence_list:
single_sentence = [sentence + '.']
If you actually want all lists containing a single sentence in the same data structure, you'd have to use a list of lists or a dictionary:
my_sentences = []
sentence_list = SENTENCE.split('.')
for sentence in sentence_list:
my_sentences.append([sentence + '.'])
To shorten this out using a list comprehension:
my_sentences = [[sentence + '.'] for sentence in SENTENCE.split('.')]
with the only culprit being that the SENTENCE splitting part will happen more often so it'll be slower working with a massive amount of sentences.

The solution using re.split() function:
import re
s = "He sad might have lung cancer. It’s just a rumor."
parts = [l if l[-1] == '.' else l + '.' for l in re.split(r'\.\s?(?!$)', s)]
print(parts)
The output:
['He sad might have lung cancer.', 'It’s just a rumor.']
r'\.\s?(?!$)' pattern, defines separator as . except that which is at the end of the text (?!$)
l if l[-1] == '.' else l + '.' - recovering . at the end of each line(as the dilimiter was not captured while splitting)

Find Pattern in Textfile From Several Elements In Several Lists?

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.

Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.

In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

String splitting issue problem with multiword expressions - python

Related

Remove combination of string in dataset in Python

python: find the last appearance of a word (of a list of words) in a text

replace any words in string that match an entry in list with a single tag (python)

Getting list of string array into separate string arrays in python

Find Pattern in Textfile From Several Elements In Several Lists?

Categories

Resources