split sentence without space in python (nltk?)

split sentence without space in python (nltk?) - python

I have a set of concatenated word and i want to split them into arrays
For example :
split_word("acquirecustomerdata")
=> ['acquire', 'customer', 'data']
I found pyenchant, but it's not available for 64bit windows.
Then i tried to split each string into sub string and then compare them to wordnet to find a equivalent word.
For example :
from nltk import wordnet as wn
def split_word(self, word):
result = list()
while(len(word) > 2):
i = 1
found = True
while(found):
i = i + 1
synsets = wn.synsets(word[:i])
for s in synsets:
if edit_distance(s.name().split('.')[0], word[:i]) == 0:
found = False
break;
result.append(word[:i])
word = word[i:]
print(result)
But this solution is not sure and is too long.
So I'm looking for your help.
Thank you

Check - Word Segmentation Task from Norvig's work.
from __future__ import division
from collections import Counter
import re, nltk
WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)
def pdist(counter):
"Make a probability distribution, given evidence from a Counter."
N = sum(counter.values())
return lambda x: counter[x]/N
P = pdist(COUNTS)
def Pwords(words):
"Probability of words, assuming each word is independent of others."
return product(P(w) for w in words)
def product(nums):
"Multiply the numbers together. (Like `sum`, but with multiplication.)"
result = 1
for x in nums:
result *= x
return result
def splits(text, start=0, L=20):
"Return a list of all (first, rest) pairs; start <= len(first) <= L."
return [(text[:i], text[i:])
for i in range(start, min(len(text), L)+1)]
def segment(text):
"Return a list of words that is the most probable segmentation of text."
if not text:
return []
else:
candidates = ([first] + segment(rest)
for (first, rest) in splits(text, 1))
return max(candidates, key=Pwords)
print segment('acquirecustomerdata')
#['acquire', 'customer', 'data']
For better solution than this you can use bigram/trigram.
More examples at : Word Segmentation Task

There is a library called "wordsegment" that you can use: http://www.grantjenks.com/docs/wordsegment/
pip install wordsegment
import wordsegment
from wordsegment import load, segment
load()
segment("acquirecustomerdata")
Output:
['acquire', 'customer', 'data']

If you have a list of all possible words, you can use something like this:
import re
word_list = ["go", "walk", "run", "jump"] # list of all possible words
pattern = re.compile("|".join("%s" % word for word in word_list))
s = "gowalkrunjump"
result = re.findall(pattern, s)

Related

Python: Finding and counting exact and approximate matches of words in txt file

My program is close to doing what I want it to do, but I have one hangup: many of the keywords I'm trying to find might have symbols in the middle or might be misspelled. I would therefore like to count the words that are misspelled as keyword matches as if they word spelled correctly. For example, let's say my text says: "settlement settl#7*nt se##tl#ment ann&&ity annuity."
I want to count the times the .txt file has the keywords "settlement" and "annuity", but also words that begin with "sett" and end with "nt" as "settlement' and words that begin "ann" and end with "y" as annuity.
I've been able to count exact words and do pretty close to what I want it to do. But now I would like to do the approximate matches. I'm not even sure this is possible. Thanks.
out1 = open("seen.txt", "w")
out2 = open("missing.txt", "w")
def count_words_in_dir(dirpath, words, action=None):
for filepath in glob.iglob(os.path.join("/Settlement", '*.txt')):
with open(filepath) as f:
data = f.read()
for key, val in words.items():
# print("key is " + key + "\n")
ct = data.count(key)
words[key] = ct
if action:
action(filepath, words)
def print_summary(filepath, words):
for key, val in sorted(words.items()):
whichout = out1 if val > 0 else out2
print(filepath, file=whichout)
print('{0}: {1}'.format(key, val), file=whichout)
filepath = sys.argv[1]
keys = ["annuity", "settlement"]
words = dict.fromkeys(keys, 0)
count_words_in_dir(filepath, words, action=print_summary)
out1.close()
out2.close()

For fuzzy matching you can use regex module, install it one time through pip install regex command.
Through this regex module you can use any expression and through {e<=2} suffix you can specify number of errors that can appear in the word to match regular expression (one error is either substitution or insertion or deletion of one symbol). This is also called edit distance or Levenshtein distance.
As an example I wrote my own function for counting words inside a given string. This function has num_errors param that specifies how many errors are alright for given word to match, I specified num_errors = 3, but you can set it to higher error rate, but don't set it to very high otherwise any word in text will match any reference word.
To split sentence into words I used re.split().
Try it online!
import regex as re
def count_words(text, words, *, num_errors = 3):
we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
cnt = {e : 0 for e in words}
for wt in re.split(r'[,.\s]+', text):
for wre, wrt in zip(we, words):
if re.fullmatch(wre, wt):
cnt[wrt] += 1
break
return cnt
text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))
Output:
{'settlement': 3, 'annuity': 2}
As a faster alternative to regex module you can use Levenshtein module, install it once through pip install python-Levenshtein command.
This module implements only edit-distance (mentioned above) and should work much faster than regex module.
Same code as above but implemented using Levenshtein module is below:
Try it online!
import Levenshtein, re
def count_words(text, words, *, num_errors = 3):
cnt = {e : 0 for e in words}
for wt in re.split(r'[,.\s]+', text):
for wr in words:
if Levenshtein.distance(wr, wt) <= num_errors:
cnt[wr] += 1
break
return cnt
text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))
Output:
{'settlement': 3, 'annuity': 2}
As requested by OP I'm implementing 3rd algorithm that doesn't use any re.split() for splitting into words, but uses re.finditer() instead.
Try it online!
import regex as re
def count_words(text, words, *, num_errors = 3):
we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
cnt = {e : 0 for e in words}
for wre, wrt in zip(we, words):
cnt[wrt] += len(list(re.finditer(wre, text)))
return cnt
text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))
Output:
{'settlement': 3, 'annuity': 2}

Replace multiple equal strings in a word with list of strings from JSON

I'm having trouble with a script to replace the normal letters to especial characters to test a translation system, here's an example (cha-mate is chá-mate but would be tested with chã-mate/chã-máte and other variations), but instead of creating this variations, it's switching all of the same characters to only one espcial letter, here's what it's printing:
chá-máte
chã-mãte
Here's what should print in theory:
cha-máte
cha-mãte
chá-mate
chã-mate
etc.
Here's the code and the json utilized:
def translation_tester(word):
esp_chars = {
'a': 'áã',
}
#words = [word]
for esp_char in esp_chars:
if esp_char in word:
replacement_chars = esp_chars[esp_char]
for i in range(len(replacement_chars)):
print(word.replace(esp_char, replacement_chars[i]))
def main():
words = ['cha-mate']
for word in words:
translation_tester(word)
main()
Anyway, any help is appreciated, thanks in advance!

To handle arbitrary number of replacements, you need to use recursion. This is how I did it.
intword = 'cha-mate'
esp_chars = {'a': 'áã'}
def wpermute(word, i=0):
for idx, c in enumerate(word[i:], i):
if c in esp_chars:
for s in esp_chars[c]:
newword = word[0:idx] + s + word[idx + 1:]
wpermute(newword, idx + 1)
if idx == len(word) -1:
print(word)
wpermute(intword)
which gives the output of 9 different ways the word can be written.
chá-máte
chá-mãte
chá-mate
chã-máte
chã-mãte
chã-mate
cha-máte
cha-mãte
cha-mate

There might be a nicer way to do this, but you can do the following (making sure to include the plain 'a' in the list of replacement chars):
import itertools
import re
def replace_at_indices(word, new_chars, indices):
new_word = word
for i, index in enumerate(indices):
new_word = new_word[:index] + new_chars[i] + new_word[index+1:]
return new_word
def translation_tester(word):
esp_chars = {
'a': 'aáã',
}
for esp_char in esp_chars:
replacement_chars = list(esp_chars[esp_char])
indices = [m.start() for m in re.finditer(esp_char, word)]
product = list(itertools.product(replacement_chars, repeat=len(indices)))
for p in product:
new_word = replace_at_indices(word, p, indices)
print(new_word)
def main():
words = ['cha-mate']
for word in words:
translation_tester(word)
main()
For your example, this should give you:
cha-mate
cha-máte
cha-mãte
chá-mate
chá-máte
chá-mãte
chã-mate
chã-máte
chã-mãte
See also:
Find all occurrences of a substring in Python
generating permutations with repetitions in python
Replacing a character from a certain index

Finding most sequences of specified length

I'm trying to write python code that will take a string and a length, and search through the string to tell me which sub-string of that particular length occurs the most, prioritizing the first if there's a tie.
For example, "cadabra abra" 2 should return ab
I tried:
import sys
def main():
inputstring = str(sys.argv[1])
length = int(sys.argv[2])
Analyze(inputstring, length)
def Analyze(inputstring, length):
count = 0;
runningcount = -1;
sequence = ""
substring = ""
for i in range(0, len(inputstring)):
substring = inputstring[i:i+length]
for j in range(i+length,len(inputstring)):
#print(runningcount)
if inputstring[j:j+2] == substring:
print("runcount++")
runningcount += 1
print(runningcount)
if runningcount > count:
count = runningcount
sequence = substring
print(sequence)
main()
But can't seem to get it to work. I know I'm at least doing something wrong with the counts, but I'm not sure what. This is my first program in Python too, but I think my problem is probably more with the algorithm than the syntax.

Try to use built-in method, they will make your life easier, this way:
>>> s = "cadabra abra"
>>> x = 2
>>> l = [s[i:i+x] for i in range(len(s)-x+1)]
>>> l
['ca', 'ad', 'da', 'ab', 'br', 'ra', 'a ', ' a', 'ab', 'br', 'ra']
>>> max(l, key=lambda m:s.count(m))
'ab'
EDIT:
Much simpler syntax as per Stefan Pochmann comment:
>>> max(l, key=s.count)

import sys
from collections import OrderedDict
def main():
inputstring = sys.argv[1]
length = int(sys.argv[2])
analyze(inputstring, length)
def analyze(inputstring, length):
d = OrderedDict()
for i in range(0, len(inputstring) - length + 1):
substring = inputstring[i:i+length]
if substring in d:
d[substring] += 1
else:
d[substring] = 1
maxlength = max(d.values())
for k,v in d.items():
if v == maxlength:
print(k)
break
main()

Pretty good stab at a solution for a first Python program. As you learn the language, spend some time reading the excellent documentation. It is full of examples and tips.
For example, the standard library includes a Counter class for counting things (obviously) and an OrderedDict class which remebers the ording in which keys are entered. But the documentation includes an example that combines the two to make an OrderedCounter, which can be used to solve you problem like this:
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
def analyze(s, n):
substrings = (s[i:i+n] for i in range(len(s)-n+1))
counts = OrderedCounter(substrings)
return max(counts.keys(), key=counts.__getitem__)
analyze("cadabra abra", 2)

Python: How to replace N random string occurrences in text?

Say that I have 10 different tokens, "(TOKEN)" in a string. How do I replace 2 of those tokens, chosen at random, with some other string, leaving the other tokens intact?

>>> import random
>>> text = '(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)'
>>> token = '(TOKEN)'
>>> replace = 'foo'
>>> num_replacements = 2
>>> num_tokens = text.count(token) #10 in this case
>>> points = [0] + sorted(random.sample(range(1,num_tokens+1),num_replacements)) + [num_tokens+1]
>>> replace.join(token.join(text.split(token)[i:j]) for i,j in zip(points,points[1:]))
'(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__foo__(TOKEN)__foo__(TOKEN)__(TOKEN)__(TOKEN)'
In function form:
>>> def random_replace(text, token, replace, num_replacements):
num_tokens = text.count(token)
points = [0] + sorted(random.sample(range(1,num_tokens+1),num_replacements)) + [num_tokens+1]
return replace.join(token.join(text.split(token)[i:j]) for i,j in zip(points,points[1:]))
>>> random_replace('....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....','(TOKEN)','FOO',2)
'....FOO....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....FOO....'
Test:
>>> for i in range(0,9):
print random_replace('....(0)....(0)....(0)....(0)....(0)....(0)....(0)....(0)....','(0)','(%d)'%i,i)
....(0)....(0)....(0)....(0)....(0)....(0)....(0)....(0)....
....(0)....(0)....(0)....(0)....(1)....(0)....(0)....(0)....
....(0)....(0)....(0)....(0)....(0)....(2)....(2)....(0)....
....(3)....(0)....(0)....(3)....(0)....(3)....(0)....(0)....
....(4)....(4)....(0)....(0)....(4)....(4)....(0)....(0)....
....(0)....(5)....(5)....(5)....(5)....(0)....(0)....(5)....
....(6)....(6)....(6)....(0)....(6)....(0)....(6)....(6)....
....(7)....(7)....(7)....(7)....(7)....(7)....(0)....(7)....
....(8)....(8)....(8)....(8)....(8)....(8)....(8)....(8)....

If you need exactly two, then:
Detect the tokens (keep some links to them, like index into the string)
Choose two at random (random.choice)
Replace them

What are you trying to do, exactly? A good answer will depend on that...
That said, a brute-force solution that comes to mind is to:
Store the 10 tokens in an array, such that tokens[0] is the first token, tokens[1] is the second, ... and so on
Create a dictionary to associate each unique "(TOKEN)" with two numbers: start_idx, end_idx
Write a little parser that walks through your string and looks for each of the 10 tokens. Whenever one is found, record the start/end indexes (as start_idx, end_idx) in the string where that token occurs.
Once done parsing, generate a random number in the range [0,9]. Lets call this R
Now, your random "(TOKEN)" is tokens[R];
Use the dictionary in step (3) to find the start_idx, end_idx values in the string; replace the text there with "some other string"

My solution in code:
import random
s = "(TOKEN)test(TOKEN)fgsfds(TOKEN)qwerty(TOKEN)42(TOKEN)(TOKEN)ttt"
replace_from = "(TOKEN)"
replace_to = "[REPLACED]"
amount_to_replace = 2
def random_replace(s, replace_from, replace_to, amount_to_replace):
parts = s.split(replace_from)
indices = random.sample(xrange(len(parts) - 1), amount_to_replace)
replaced_s_parts = list()
for i in xrange(len(parts)):
replaced_s_parts.append(parts[i])
if i < len(parts) - 1:
if i in indices:
replaced_s_parts.append(replace_to)
else:
replaced_s_parts.append(replace_from)
return "".join(replaced_s_parts)
#TEST
for i in xrange(5):
print random_replace(s, replace_from, replace_to, 2)
Explanation:
Splits string into several parts using replace_from
Chooses indexes of tokens to replace using random.sample. This returned list contains unique numbers
Build a list for string reconstruction, replacing tokens with generated index by replace_to.
Concatenate all list elements into single string

Try this solution:
import random
def replace_random(tokens, eqv, n):
random_tokens = eqv.keys()
random.shuffle(random_tokens)
for i in xrange(n):
t = random_tokens[i]
tokens = tokens.replace(t, eqv[t])
return tokens
Assuming that a string with tokens exists, and a suitable equivalence table can be constructed with a replacement for each token:
tokens = '(TOKEN1) (TOKEN2) (TOKEN3) (TOKEN4) (TOKEN5) (TOKEN6) (TOKEN7) (TOKEN8) (TOKEN9) (TOKEN10)'
equivalences = {
'(TOKEN1)' : 'REPLACEMENT1',
'(TOKEN2)' : 'REPLACEMENT2',
'(TOKEN3)' : 'REPLACEMENT3',
'(TOKEN4)' : 'REPLACEMENT4',
'(TOKEN5)' : 'REPLACEMENT5',
'(TOKEN6)' : 'REPLACEMENT6',
'(TOKEN7)' : 'REPLACEMENT7',
'(TOKEN8)' : 'REPLACEMENT8',
'(TOKEN9)' : 'REPLACEMENT9',
'(TOKEN10)' : 'REPLACEMENT10'
}
You can call it like this:
replace_random(tokens, equivalences, 2)
> '(TOKEN1) REPLACEMENT2 (TOKEN3) (TOKEN4) (TOKEN5) (TOKEN6) (TOKEN7) (TOKEN8) REPLACEMENT9 (TOKEN10)'

There are lots of ways to do this. My approach would be to write a function that takes the original string, the token string, and a function that returns the replacement text for an occurrence of the token in the original:
def strByReplacingTokensUsingFunction(original, token, function):
outputComponents = []
matchNumber = 0
unexaminedOffset = 0
while True:
matchOffset = original.find(token, unexaminedOffset)
if matchOffset < 0:
matchOffset = len(original)
outputComponents.append(original[unexaminedOffset:matchOffset])
if matchOffset == len(original):
break
unexaminedOffset = matchOffset + len(token)
replacement = function(original=original, offset=matchOffset, matchNumber=matchNumber, token=token)
outputComponents.append(replacement)
matchNumber += 1
return ''.join(outputComponents)
(You could certainly change this to use shorter identifiers. My style is somewhat more verbose than typical Python style.)
Given that function, it's easy to replace two random occurrences out of ten. Here's some sample input:
sampleInput = 'a(TOKEN)b(TOKEN)c(TOKEN)d(TOKEN)e(TOKEN)f(TOKEN)g(TOKEN)h(TOKEN)i(TOKEN)j(TOKEN)k'
The random module has a handy method for picking random items from a population (not picking the same item twice):
import random
replacementIndexes = random.sample(range(10), 2)
Then we can use the function above to replace the randomly-chosen occurrences:
sampleOutput = strByReplacingTokensUsingFunction(sampleInput, '(TOKEN)',
(lambda matchNumber, token, **keywords:
'REPLACEMENT' if (matchNumber in replacementIndexes) else token))
print sampleOutput
And here's some test output:
a(TOKEN)b(TOKEN)cREPLACEMENTd(TOKEN)e(TOKEN)fREPLACEMENTg(TOKEN)h(TOKEN)i(TOKEN)j(TOKEN)k
Here's another run:
a(TOKEN)bREPLACEMENTc(TOKEN)d(TOKEN)e(TOKEN)f(TOKEN)gREPLACEMENTh(TOKEN)i(TOKEN)j(TOKEN)k

from random import sample
mystr = 'adad(TOKEN)hgfh(TOKEN)hjgjh(TOKEN)kjhk(TOKEN)jkhjk(TOKEN)utuy(TOKEN)tyuu(TOKEN)tyuy(TOKEN)tyuy(TOKEN)tyuy(TOKEN)'
def replace(mystr, substr, n_repl, replacement='XXXXXXX', tokens=10, index=0):
choices = sorted(sample(xrange(tokens),n_repl))
for i in xrange(choices[-1]+1):
index = mystr.index(substr, index) + 1
if i in choices:
mystr = mystr[:index-1] + mystr[index-1:].replace(substr,replacement,1)
return mystr
print replace(mystr,'(TOKEN)',2)

How to find all occurrences of a substring?

Python has string.find() and string.rfind() to get the index of a substring in a string.
I'm wondering whether there is something like string.find_all() which can return all found indexes (not only the first from the beginning or the first from the end).
For example:
string = "test test test test"
print string.find('test') # 0
print string.rfind('test') # 15
#this is the goal
print string.find_all('test') # [0,5,10,15]
For counting the occurrences, see Count number of occurrences of a substring in a string.

There is no simple built-in string function that does what you're looking for, but you could use the more powerful regular expressions:
import re
[m.start() for m in re.finditer('test', 'test test test test')]
#[0, 5, 10, 15]
If you want to find overlapping matches, lookahead will do that:
[m.start() for m in re.finditer('(?=tt)', 'ttt')]
#[0, 1]
If you want a reverse find-all without overlaps, you can combine positive and negative lookahead into an expression like this:
search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]
re.finditer returns a generator, so you could change the [] in the above to () to get a generator instead of a list which will be more efficient if you're only iterating through the results once.

>>> help(str.find)
Help on method_descriptor:
find(...)
S.find(sub [,start [,end]]) -> int
Thus, we can build it ourselves:
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub) # use start += 1 to find overlapping matches
list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]
No temporary strings or regexes required.

Here's a (very inefficient) way to get all (i.e. even overlapping) matches:
>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]

Use re.finditer:
import re
sentence = input("Give me a sentence ")
word = input("What word would you like to find ")
for match in re.finditer(word, sentence):
print (match.start(), match.end())
For word = "this" and sentence = "this is a sentence this this" this will yield the output:
(0, 4)
(19, 23)
(24, 28)

Again, old thread, but here's my solution using a generator and plain str.find.
def findall(p, s):
'''Yields all the positions of
the pattern p in the string s.'''
i = s.find(p)
while i != -1:
yield i
i = s.find(p, i+1)
Example
x = 'banananassantana'
[(i, x[i:i+2]) for i in findall('na', x)]
returns
[(2, 'na'), (4, 'na'), (6, 'na'), (14, 'na')]

You can use re.finditer() for non-overlapping matches.
>>> import re
>>> aString = 'this is a string where the substring "is" is repeated several times'
>>> print [(a.start(), a.end()) for a in list(re.finditer('is', aString))]
[(2, 4), (5, 7), (38, 40), (42, 44)]
but won't work for:
In [1]: aString="ababa"
In [2]: print [(a.start(), a.end()) for a in list(re.finditer('aba', aString))]
Output: [(0, 3)]

Come, let us recurse together.
def locations_of_substring(string, substring):
"""Return a list of locations of a substring."""
substring_length = len(substring)
def recurse(locations_found, start):
location = string.find(substring, start)
if location != -1:
return recurse(locations_found + [location], location+substring_length)
else:
return locations_found
return recurse([], 0)
print(locations_of_substring('this is a test for finding this and this', 'this'))
# prints [0, 27, 36]
No need for regular expressions this way.

If you're just looking for a single character, this would work:
string = "dooobiedoobiedoobie"
match = 'o'
reduce(lambda count, char: count + 1 if char == match else count, string, 0)
# produces 7
Also,
string = "test test test test"
match = "test"
len(string.split(match)) - 1
# produces 4
My hunch is that neither of these (especially #2) is terribly performant.

this is an old thread but i got interested and wanted to share my solution.
def find_all(a_string, sub):
result = []
k = 0
while k < len(a_string):
k = a_string.find(sub, k)
if k == -1:
return result
else:
result.append(k)
k += 1 #change to k += len(sub) to not search overlapping results
return result
It should return a list of positions where the substring was found.
Please comment if you see an error or room for improvment.

This does the trick for me using re.finditer
import re
text = 'This is sample text to test if this pythonic '\
'program can serve as an indexing platform for '\
'finding words in a paragraph. It can give '\
'values as to where the word is located with the '\
'different examples as stated'
# find all occurances of the word 'as' in the above text
find_the_word = re.finditer('as', text)
for match in find_the_word:
print('start {}, end {}, search string \'{}\''.
format(match.start(), match.end(), match.group()))

This thread is a little old but this worked for me:
numberString = "onetwothreefourfivesixseveneightninefiveten"
testString = "five"
marker = 0
while marker < len(numberString):
try:
print(numberString.index("five",marker))
marker = numberString.index("five", marker) + 1
except ValueError:
print("String not found")
marker = len(numberString)

You can try :
>>> string = "test test test test"
>>> for index,value in enumerate(string):
if string[index:index+(len("test"))] == "test":
print index
0
5
10
15

You can try :
import re
str1 = "This dress looks good; you have good taste in clothes."
substr = "good"
result = [_.start() for _ in re.finditer(substr, str1)]
# result = [17, 32]

When looking for a large amount of key words in a document, use flashtext
from flashtext import KeywordProcessor
words = ['test', 'exam', 'quiz']
txt = 'this is a test'
kwp = KeywordProcessor()
kwp.add_keywords_from_list(words)
result = kwp.extract_keywords(txt, span_info=True)
Flashtext runs faster than regex on large list of search words.

This function does not look at all positions inside the string, it does not waste compute resources. My try:
def findAll(string,word):
all_positions=[]
next_pos=-1
while True:
next_pos=string.find(word,next_pos+1)
if(next_pos<0):
break
all_positions.append(next_pos)
return all_positions
to use it call it like this:
result=findAll('this word is a big word man how many words are there?','word')

src = input() # we will find substring in this string
sub = input() # substring
res = []
pos = src.find(sub)
while pos != -1:
res.append(pos)
pos = src.find(sub, pos + 1)

Whatever the solutions provided by others are completely based on the available method find() or any available methods.
What is the core basic algorithm to find all the occurrences of a
substring in a string?
def find_all(string,substring):
"""
Function: Returning all the index of substring in a string
Arguments: String and the search string
Return:Returning a list
"""
length = len(substring)
c=0
indexes = []
while c < len(string):
if string[c:c+length] == substring:
indexes.append(c)
c=c+1
return indexes
You can also inherit str class to new class and can use this function
below.
class newstr(str):
def find_all(string,substring):
"""
Function: Returning all the index of substring in a string
Arguments: String and the search string
Return:Returning a list
"""
length = len(substring)
c=0
indexes = []
while c < len(string):
if string[c:c+length] == substring:
indexes.append(c)
c=c+1
return indexes
Calling the method
newstr.find_all('Do you find this answer helpful? then upvote
this!','this')

This is solution of a similar question from hackerrank. I hope this could help you.
import re
a = input()
b = input()
if b not in a:
print((-1,-1))
else:
#create two list as
start_indc = [m.start() for m in re.finditer('(?=' + b + ')', a)]
for i in range(len(start_indc)):
print((start_indc[i], start_indc[i]+len(b)-1))
Output:
aaadaa
aa
(0, 1)
(1, 2)
(4, 5)

Here's a solution that I came up with, using assignment expression (new feature since Python 3.8):
string = "test test test test"
phrase = "test"
start = -1
result = [(start := string.find(phrase, start + 1)) for _ in range(string.count(phrase))]
Output:
[0, 5, 10, 15]

I think the most clean way of solution is without libraries and yields:
def find_all_occurrences(string, sub):
index_of_occurrences = []
current_index = 0
while True:
current_index = string.find(sub, current_index)
if current_index == -1:
return index_of_occurrences
else:
index_of_occurrences.append(current_index)
current_index += len(sub)
find_all_occurrences(string, substr)
Note: find() method returns -1 when it can't find anything

The pythonic way would be:
mystring = 'Hello World, this should work!'
find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]
# s represents the search string
# c represents the character string
find_all(mystring,'o') # will return all positions of 'o'
[4, 7, 20, 26]
>>>

if you only want to use numpy here is a solution
import numpy as np
S= "test test test test"
S2 = 'test'
inds = np.cumsum([len(k)+len(S2) for k in S.split(S2)[:-1]])- len(S2)
print(inds)

if you want to use without re(regex) then:
find_all = lambda _str,_w : [ i for i in range(len(_str)) if _str.startswith(_w,i) ]
string = "test test test test"
print( find_all(string, 'test') ) # >>> [0, 5, 10, 15]

please look at below code
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
def get_substring_indices(text, s):
result = [i for i in range(len(text)) if text.startswith(s, i)]
return result
if __name__ == '__main__':
text = "How much wood would a wood chuck chuck if a wood chuck could chuck wood?"
s = 'wood'
print get_substring_indices(text, s)

def find_index(string, let):
enumerated = [place for place, letter in enumerate(string) if letter == let]
return enumerated
for example :
find_index("hey doode find d", "d")
returns:
[4, 7, 13, 15]

Not exactly what OP asked but you could also use the split function to get a list of where all the substrings don't occur. OP didn't specify the end goal of the code but if your goal is to remove the substrings anyways then this could be a simple one-liner. There are probably more efficient ways to do this with larger strings; regular expressions would be preferable in that case
# Extract all non-substrings
s = "an-example-string"
s_no_dash = s.split('-')
# >>> s_no_dash
# ['an', 'example', 'string']
# Or extract and join them into a sentence
s_no_dash2 = ' '.join(s.split('-'))
# >>> s_no_dash2
# 'an example string'
Did a brief skim of other answers so apologies if this is already up there.

def count_substring(string, sub_string):
c=0
for i in range(0,len(string)-2):
if string[i:i+len(sub_string)] == sub_string:
c+=1
return c
if __name__ == '__main__':
string = input().strip()
sub_string = input().strip()
count = count_substring(string, sub_string)
print(count)

I runned in the same problem and did this:
hw = 'Hello oh World!'
list_hw = list(hw)
o_in_hw = []
while True:
o = hw.find('o')
if o != -1:
o_in_hw.append(o)
list_hw[o] = ' '
hw = ''.join(list_hw)
else:
print(o_in_hw)
break
Im pretty new at coding so you can probably simplify it (and if planned to used continuously of course make it a function).
All and all it works as intended for what i was doing.
Edit: Please consider this is for single characters only, and it will change your variable, so you have to create a copy of the string in a new variable to save it, i didnt put it in the code cause its easy and its only to show how i made it work.

By slicing we find all the combinations possible and append them in a list and find the number of times it occurs using count function
s=input()
n=len(s)
l=[]
f=input()
print(s[0])
for i in range(0,n):
for j in range(1,n+1):
l.append(s[i:j])
if f in l:
print(l.count(f))

To find all the occurence of a character in a give string and return as a dictionary
eg: hello
result :
{'h':1, 'e':1, 'l':2, 'o':1}
def count(string):
result = {}
if(string):
for i in string:
result[i] = string.count(i)
return result
return {}
or else you do like this
from collections import Counter
def count(string):
return Counter(string)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split sentence without space in python (nltk?) - python

There is a library called "wordsegment" that you can use: http://www.grantjenks.com/docs/wordsegment/ pip install wordsegment import wordsegment from wordsegment import load, segment load() segment("acquirecustomerdata") Output: ['acquire', 'customer', 'data']

If you have a list of all possible words, you can use something like this: import re word_list = ["go", "walk", "run", "jump"] # list of all possible words pattern = re.compile("|".join("%s" % word for word in word_list)) s = "gowalkrunjump" result = re.findall(pattern, s)

Related

Python: Finding and counting exact and approximate matches of words in txt file

Replace multiple equal strings in a word with list of strings from JSON

Finding most sequences of specified length

Python: How to replace N random string occurrences in text?

How to find all occurrences of a substring?

Categories

Resources