Python: How to slice string using string? - python

Assuming that the user entered:
"i like eating big apples"
Want to remove "eating" and "apples" together with whatever is in between these two words. Output in this case
"i like"
In another case, if the user entered:
"i like eating apples very much"
Expected output:
"i like very much"
And I want to slice the input starting from "eating" to "apples"
(However, the index cannot be used as you are unsure how long the user is going to type, but it is guaranteed that "eating" and "apples" will be entered)
So, is there any way that we can slide without using the index, instead, we indicate the start and end of the slide with another string?

Slicing a string in python is like this:
mystr = "i like eating big apples"
print(mystr[10:20])
This means between the 10th boundary of characters in the string and the 20th. So it will become: ing big ap.
Now the question is how to find out which index 'eating' starts and 'apple' ends.
Use the .index method to find the beginning of something in a string.
mystr.index('eating') returns 7, so if you print mystr[7:] (which means from the 7th index till the last of the string) you'll have 'eating big apples'.
The second part is a little tricky. If you use mystr.index('apple'), you'll get the beginning of apple, (18), so mystr[7:18] will give you 'eating big '.
In fact you should go some characters further to include the apple word too, which is 5 chars exactly, and this number will be returned by len('apple'). So the final result is:
start = mystr.index('eating')
stop = mystr.index('apple') + len('apple')
print(mystr[start:stop])

You can do the follwoing:
s = "i like eating big apples"
start_ = s.find("eating")
end_ = s.find("apples") + len("apples")
s[start_:end_] # 'eating big apples'
Using find() to find the starting indices of the desired word in the string, and then adjust the start_/end_ to your needs.
To remove the sub string:
s[:start_] + s[end_:] # i like
And for:
s = "i like eating apples very much"
end_ = s.find("apples") + len("apples")
start_ = s.find("eating")
s[:start_] + s[end_:] # 'i like very much'

maybe you can use this:
txt = "Hello, welcome to my world."
x = txt.find("welcome")
print(x)
Which outputs: 7
To find "eating" and "apple"

S = "i like eating big apples"
Index = S.find("eating")
output = S[Index:-1]

Use find() or rfind() method for searching substring's occurrence indices, then paste method's result into slice:
s = "i like eating big apples"
substr = s[s.rfind("eating"):s.rfind("apples")]

You can use str.partition to split string into three parts.
In [112]: s = "i like eating apples very much"
In [113]: h, _, t = s.partition('eating')
In [114]: _, _, t = t.partition('apples')
In [115]: h + t
Out[115]: 'i like very much'
In [116]: s = "i like eating big apples"
In [117]: h, _, t = s.partition('eating')
In [118]: _, _, t = t.partition('apples')
In [119]: h + t
Out[119]: 'i like '

Related

Remove combination of string in dataset in Python

I have a dataset in Python where I want to remove certain combinations of words of colomnX in a new columnY.
Example of 2 rows of columnX:
what is good: the weather what needs improvwement: the house
what is good: everything what needs improvement: nothing
I want tot delete the following combination of words: "what is good" & "what needs improvement".
In the end the following text should remain in the columnY:
the weather the house
everything nothing
I have the following script:
stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ''.join([item in x.split() if item nog in stoplist]))
But it doesn't work. What am I doing wrong here?
In your case the replacement won't happen as the condition if item not in stoplist (in item in x.split() if item not in stoplist) checks if a single word match any phrase of the stoplist, which is wrong.
Instead combine your stop phrases into a regex pattern (for replacement) as shown below:
df['columnY'] = df.columnX.replace(rf"({'|'.join(f'({i})' for i in stoplist)}): ", "", regex=True)
columnX columnY
0 what is good: the weather what needs improveme... the weather the house
1 what is good: everything what needs improvemen... everything nothing
Maybe you can operate on the columns itself.
df["Y"] = df["X"]
df.Y = df.Y.str.replace("what is good", "")
So you would have to do this for every item in your stop list. But I am not sure how many items you have.
So for example
replacement_map = {"what needs improvement": "", "what is good": ""}
for old, new in replacement_map.items():
df.Y = df.Y.str.replace(old, new)
if you need to specify different translations or
items_to_replace = ["what needs improvement", "what is good"]
for item_to_replace in items_to_replace:
df.Y = df.Y.str.replace(item_to_replace, "")
if the item should always be deleted.
Or you can skip the loop if you express it as a regex:
items_to_replace = ["what needs improvement", "what is good"]
replace_regex = r"|".join(item for item in items_to_replace)
df.Y = df.Y.str.replace(replace_regex , "")
(Credits: #MatBailie & #romanperekhrest)
another way without using a regex and to still use apply would be to use a simple function:
def func(s):
for item in stoplist:
s = s.replace(item, '')
return s
df['columnY']=df['columnY'].apply(func)

How to slice a string input at a certain unknown index

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:
The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
then a sequence of non-questionmark-characters [^?]+,
followed by a literal question mark \?.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall instead of search:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

Drop all strings that are a subset of another string in the same list

I'm working on a scraping project and for some reason on some paragraphs I get both the complete paragraph and also the same paragraph divided in segments. So, if the paragraph is "My house is green and. I like it.", I sometimes get:
["My house is green. I like it.", "My house is green.", "I like it."]
So, when I turn everything into text I will get that paragraph duplicated. Is there any way I can check which strings are a subset of other strings in a list?
My desired output in this case would be to be left only with ["My house is green. I like it."]
An efficient approach is to iterate through the list sorted by the lengths of phrases in reverse order, and add each possible sub-phrase to a set, so that you can use the set to efficiently check if the current phrase is a sub-phrase of a previous, longer phrase:
output = []
seen = set()
for phrase in sorted(l, key=len, reverse=True):
words = tuple(phrase.split())
if words not in seen:
output.append(phrase)
seen.update({words[i: i + n + 1] for n in range(len(words)) for i in range(len(words) - n)})
so that given:
l = ["My house is green. I like it.", "My house is green.", "I like it."]
output becomes:
['My house is green. I like it.']
I would take the longest string out of the list like this:
arr = ["My house is green. I like it.", "My house is green.", "I like it."]
print(max(arr, key=len))
The longest string can't be a substring of the others by definition

How to select sub-strings based on the presence of word pairs? Python

I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:
'Sentence' 'First2'
0 If this is a string what does it say? 0 can I
1 And this is a string, should it say more? 1 should it
2 This is yet another string. 2 what does
3 etc. etc. 3 etc. etc
The result I want from the above example would be:
0 what does it say?
1 should it say more?
2
The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.
a = df['Sentence']
b = df['First2']
#The function seems to loop over all r's but only over the first b:
def func(z):
for x in b:
if x in r:
s = z[z.index(x):]
return s
else:
return ‘’
df['Segments'] = a.apply(func)
It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?
I believe there is a bug in your code.
else:
return ''
This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.
A sample working code is below:
# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
for first_two in first_twos:
if first_two in sentence:
s = sentence[sentence.index(first_two):]
return s
return ''
df['Segments'] = a.apply(func)
And the output:
df:
{
'First2': ['can I', 'should it', 'what does'],
'Segments': ['what does it say? ', 'should it say more?', ''],
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string. ' ]
}
you can loop over two things easily via zip(iterator,iterator_foo)
My question was answered by the following code:
def func(r):
for i in b:
if i in r:
q = r[r.index(i):]
return q
return ''
df['Segments'] = a.apply(func)
The solution was pointed out here by Daming Lu (only the last line is different from his). The problem was in the last two lines of the original code:
else:
return ''
This caused the function to return too early. Daming Lu's answer was better than the answer to the possible duplicate question python for-loop only executes once? which created other problems - as explained in my respons to wii. (So I am not sure mine really is a duplicate.)

String splitting issue problem with multiword expressions

I have a series of strings like:
'i would like a blood orange'
I also have a list of strings like:
["blood orange", "loan shark"]
Operating on the string, I want the following list:
["i", "would", "like", "a", "blood orange"]
What is the best way to get the above list? I've been using re throughout my code, but I'm stumped with this issue.
This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.
(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)
def split_with_phrases(sentence, phrase_list):
words = sentence.split(" ")
phrases = set(tuple(s.split(" ")) for s in phrase_list)
print phrases
max_phrase_length = max(len(p) for p in phrases)
# Find a phrase within words starting at the specified index. Return the
# phrase as a tuple, or None if no phrase starts at that index.
def find_phrase(start_idx):
# Iterate backwards, so we'll always find longer phrases before shorter ones.
# Otherwise, if we have a phrase set like "hello world" and "hello world two",
# we'll never match the longer phrase because we'll always match the shorter
# one first.
for phrase_length in xrange(max_phrase_length, 0, -1):
test_word = tuple(words[idx:idx+phrase_length])
if test_word in phrases:
return test_word
return None
skip = 0
for idx in xrange(len(words)):
if skip:
# This word was returned as part of a previous phrase; skip it.
skip -= 1
continue
phrase = find_phrase(idx)
if phrase is not None:
skip = len(phrase)
yield " ".join(phrase)
continue
yield words[idx]
print [s for s in split_with_phrases('i would like a blood orange',
["blood orange", "loan shark"])]
Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.
list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]
for item in input_list:
for str_lst in list_to_split:
if item in str_lst:
tmp = str_lst.split(item)
lst = []
for itm in tmp:
if itm!= '':
lst.append(itm)
lst.append(item)
print lst
output:
['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']
One quick and dirty, completely un-optimized approach might be to just replace the compounds in the string with a version including a different separator (preferably one that does not occur anywhere else in your target string or compound words). Then split and replace. A more efficient approach would be to iterate only once through the string, matching the compound words where appropriate - but you may have to watch out for instances where there are nested compounds, etc., depending on your array.
#!/usr/bin/python
import re
my_string = "i would like a blood orange"
compounds = ["blood orange", "loan shark"]
for i in range(0,len(compounds)):
my_string = my_string.replace(compounds[i],compounds[i].replace(" ","&"))
my_segs = re.split(r"\s+",my_string)
for i in range(0,len(my_segs)):
my_segs[i] = my_segs[i].replace("&"," ")
print my_segs
Edit: Glenn Maynard's solution is better.

Categories