How to save a regular expression user input value (Python) - python

I am making a simple chat bot in Python. It has a text file with regular expressions which help to generate the output. The user input and the bot output are separated by a | character.
my name is (?P<'name'>\w*) | Hi {'name'}!
This works fine for single sets of input and output responses, however I would like the bot to be able to store the regex values the user inputs and then use them again (i.e. give the bot a 'memory'). For example, I would like to have the bot store the value input for 'name', so that I can have this in the rules:
my name is (?P<'word'>\w*) | You said your name is {'name'} already!
my name is (?P<'name'>\w*) | Hi {'name'}!
Having no value for 'name' yet, the bot will first output 'Hi steve', and once the bot does have this value, the 'word' rule will apply. I'm not sure if this is easily feasible given the way I have structured my program. I have made it so that the text file is made into a dictionary with the key and value separated by the | character, when the user inputs some text, the program compares whether the user input matches the input stored in the dictionary, and prints out the corresponding bot response (there is also an 'else' case if no match is found).
I must need something to happen at the comparing part of the process so that the user's regular expression text is saved and then substituted back into the dictionary somehow. All of my regular expressions have different names associated with them (there are no two instances of 'word', for example...there is 'word', 'word2', etc), I did this as I thought it would make this part of the process easier. I may have structured the thing completely wrong to do this task though.
Edit: code
import re
io = {}
with open("rules.txt") as brain:
for line in brain:
key, value = line.split('|')
io[key] = value
string = str(raw_input('> ')).lower()+' word'
x = 1
while x == 1:
for regex, output in io.items():
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
string = str(raw_input('> ')).lower()+' word'
else:
print ' Sorry?'
string = str(raw_input('> ')).lower()+' word'

I had some difficulty to understand the principle of your algorithm because I'm not used to employ the named groups.
The following code is the way I would solve your problem, I hope it will give you some ideas.
I think that having only one dictionary isn't a good principle, it increases the complexity of reasoning and of the algorithm. So I based the code on two dictionaries: direg and memory
Theses two dictionaries have keys that are indexes of groups, not all the indexes, only some particular ones, the indexes of the groups being the last in each individual patterns.
Because, for the fun, I decided that the regexes must be able to have several groups.
What I call individual patterns in my code are the following strings:
"[mM]y name [Ii][sS] (\w*)"
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
You see that the second individual pattern has 2 capturing groups: consequently there are 3 individual patterns, but a total of 4 groups in all the individual groups.
So the creation of the dictionaries needs some additional care to take account of the fact that the index of the last matching group ( which I use with help of the attribute of name lastindex of a regex MatchObject ) may not correspond to the numbering of individual regexes present in the regex pattern: it's harder to explain than to understand. That's the reason why I count in the function distr() the occurences of strings {0} {1} {2} {3} {4} etc whose number MUST be the same as the number of groups defined in the corresponding individual pattern.
I found the suggestion of Laurence D'Oliveiro to use '||' instead of '|' as separator interesting.
My code simulates a session in which several inputs are done:
import re
regi = ("[mM]y name [Ii][sS] (\w*)"
"||Hi {0}!"
"||You said that your name was {0} !!!",
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"||OK here's your file {0}\\{1} :"
"||I already gave you the file {0}\\{1} !",
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
"||OK, I will do {0}"
"||You already did {0}. Do yo really want again ?")
direg = {}
memory = {}
def distr(regi,cnt = 0,di = direg,mem = memory,
regnb = re.compile('{\d+}')):
for i,el in enumerate(regi,start=1):
sp = el.split('||')
cnt += len(regnb.findall(sp[1]))
di[cnt] = sp[1]
mem[cnt] = sp[2]
yield sp[0]
regx = re.compile('|'.join(distr(regi)))
print 'direg :\n',direg
print
print 'memory :\n',memory
for inp in ('I say that my name is Armano the 1st',
'In repertory ONE I want file SPACE',
'I want to record music',
'In repertory ONE I want file SPACE',
'I say that my name is Armstrong',
'But my name IS Armstrong now !!!',
'In repertory TWO I want file EARTH',
'Now my name is Helena'):
print '\ninput ==',inp
mat = regx.search(inp)
if direg[mat.lastindex]:
print 'output ==',direg[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
direg[mat.lastindex] = None
memory[mat.lastindex] = memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
else:
print 'output ==',memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
if not memory[mat.lastindex].startswith('Sorry'):
memory[mat.lastindex] = 'Sorry, ' \
+ memory[mat.lastindex][0].lower()\
+ memory[mat.lastindex][1:]
result
direg :
{1: 'Hi {0}!', 3: "OK here's your file {0}\\{1} :", 4: 'OK, I will do {0}'}
memory :
{1: 'You said that your name was {0} !!!', 3: 'I already gave you the file {0}\\{1} !', 4: 'You already did {0}. Do yo really want again ?'}
input == I say that my name is Armano the 1st
output == Hi Armano!
input == In repertory ONE I want file SPACE
output == OK here's your file ONE\SPACE :
input == I want to record music
output == OK, I will do record music
input == In repertory ONE I want file SPACE
output == I already gave you the file ONE\SPACE !
input == I say that my name is Armstrong
output == You said that your name was Armano !!!
input == But my name IS Armstrong now !!!
output == Sorry, you said that your name was Armano !!!
input == In repertory TWO I want file EARTH
output == Sorry, i already gave you the file ONE\SPACE !
input == Now my name is Helena
output == Sorry, you said that your name was Armano !!!

OK, let me see if I understand this:
You want to a dictionary of key-value pairs. This will be the “memory” of the chatbot.
You want to apply regular-expression rules to user input. But which rules might apply is conditional on which keys are already present in the memory dictionary: if “name” is not yet defined, then the rule that defines “name” applies; but if it is, then the rule that mentions “word” applies.
Seems to me you need more information attached to your rules. For example, the “word” rule you gave above shouldn’t actually add “word” to the dictionary, otherwise it would only apply once (imagine if the user keeps trying to say “my name is x” more than twice).
Does that give you a bit more idea about how to proceed?
Oh, by the way, I think “|” is a poor choice for a separator character, because it can occur in regular expressions. Not sure what to suggest: how about “||”?

Related

new to python (as in 1 week in) and need help getting pointed in the right direction

Need to write a code for a school lab.
Input is First name Middle name Last Name
Output needs to be Last name, First initial. Middle Initial.
It must also work with just first and last name.
Examples:
Input: Jane Ann Doe
Output: Doe, J. A.
Input: Jane Doe
Output: Doe, J.
Code thus far is:
# 2.12 Lab, input First name Middle name last name
# result to print Last name, fist initial. Middle initial period.
# result must account for user not having middle name
name = input()
tokens = name.split()
I do not understand how to write an if statement followed by print statement to get the desired output.
name = input("Enter name: ")
tokens = name.split()
if int(len(tokens)) > 2:
print(tokens[-1] + ",", tokens[0][0]+".", tokens[1][0]+".")
else:
print(tokens[-1] + ",", tokens[0][0]+".")
With what you have so far, tokens will be a list of the words you entered, such as ['Jane', 'Ann', 'Doe'].
What you need to do is to print out the last of those items in full, followed by a comma. Then each of the other items in order but with just the first letter followed by a period.
You can get the last item of a list x with x[-1]. You can get each of the others with a loop like:
for item in x[:-1]:
doSomethingWith(item)
And the first character of the string item can be extracted with item[0].
That should hopefully be enough to get you on your way.
If it's not enough, read on, though it would be far better for you if tou tried to nut it out yourself first.
...
No? Okay then, here we go ...
The following code shows one way you can do this, with hopefully enough comments that you will understand:
import sys
# Get line and turn into list of words.
inputLine = input("Please enter your full name: ")
tokens = inputLine.split()
print(tokens)
# Pre-check to make sure at least two words were entered.
if len(tokens) < 2:
print("ERROR: Need at least two tokens in the name.")
sys.exit(0)
# Print last word followed by comma, no newline (using f-strings).
print(f"{tokens[-1]},", end="")
# Process all but the last word.
for namePart in tokens[:-1]:
# Print first character of word followed by period, no newline.
print(f" {namePart[0]}.", end="")
# Make sure line is terminated by a newline character.
print()
You could no doubt make that more robust against weird edge cases like a first name of "." but it should be okay for an educational assignment.
But it handles even more complex names such as "River Rocket Blue Dallas Oliver" (yes, I'm serious, that's a real name).
# 2.12 Lab, input First name Middle name last name
# result to print Last name, fist initial. Middle initial period.
# result must account for user not having middle name
name = input()
tokens = name.split()
if len(tokens) == 2: # to identify if only two names entered
last_name = tokens[1]
first_init = tokens[0][0]
print(last_name, ',', first_init,'.',sep='')
if len(tokens) == 3: # to identify if three names entered
last_name = tokens[2]
first_init = tokens[0][0]
middle_init = tokens [1][0]
print(last_name, ',',' ',first_init,'.', ' ', middle_init,'.',sep='')
Try this code:
a=input()
name=a.split(" ")
index=len(name)
if index==3:
print(f"{name[-1]},{name[-3][0]}.{name[-2][0]}.")
else:
print(f"{name[-1]},{name[-2][0]}.")
Here is the explanation of the code:
First,using input(),we get the name of the person.
Then,we split the name using .split()with the parameter (written in the parenthesis) as " "
next we will find the no.of elements in the list (.split() returns a list) for the if statement
Then we print the output through the if statement shown above and using indexing ,we extract the first letter.

Check if there are numbers around a keyword in a text file

I am having a text file 'Filter.txt' which contains a specific keyword 'D&O insurance'. I would check if there are numbers in the sentence which contains that keyword, as well as the 2 sentences before and after that.
For example, I have a long paragraphe like this:
"International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered? "
The target word is "D&O insurance." If I wanted to extract the target sentence (D&O insurance grants cover on a claims-made basis.) as well as the preceding and following sentences (Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. and How much is enough?), what would be a good approach?
This is what I'm trying to do so far. However I don't really know how to apply to find ways to check in the whole sentence and the ones around it.
for line in open('Filter.txt'):
match = re.search('D&O insurance(\d+)',line)
if match:
print match.group(1)
I'm new to programming, so I'm looking for the possible solutions for that purpose.
Thank you for your help!
Okay I'm going to take a stab at this. Assume string is the entire contents of your .txt file (you may need to clean the '/n's out).
You're going to want to make a list of potential sentence endings, use that list to find the index positions of the sentence endings, and then use THAT list to make a list of the sentences in the file.
string = "International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered?"
endings = ['! ', '? ','. ']
def pos_find(string):
lst = []
for ending in endings:
i = string.find(ending)
if i != -1:
lst.append(string.find(ending))
return min(lst)
def sort_sentences(string):
sentences = []
while True:
try:
i = pos_find(string)
sentences.append(string[0:i+1])
string = string[i+2:]
except ValueError:
sentences.append(string)
break
return sentences
sentences = sort_sentences(string)
Once you have the list of sentences (I got a little weary here, so forgive the spaghetti code - the functionality is there), you will need to comb through that list to find characters that could be integers (this is how I'm checking for numbers...but you COULD do it different).
for i in range(len(sentences)):
sentence = sentences[i]
match = sentence.find('D&O insurance')
print(match)
if match >= 0:
lst = [sentences[i-1],sentence, sentences[i+2]]
for j in range(len(lst)):
sen = lst[j]
for char in sen:
try:
int(char)
print(f'Found {char} in "{sen}", index {j}')
except ValueError:
pass
Note that you will have to make some modifications to capture multi-number numbers. This will just print something for each integer in the full number (i.e. it will print a statement for 1, 0, and 0 if it finds 100 in the file). You will also need to catch the two edge cases where the D&O insurance substring is found in the first or last sentences. In the code above, you would throw an error because there would be no i-1 (if it's the first) index location.

How to select sub-strings based on the presence of word pairs? Python

I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:
'Sentence' 'First2'
0 If this is a string what does it say? 0 can I
1 And this is a string, should it say more? 1 should it
2 This is yet another string. 2 what does
3 etc. etc. 3 etc. etc
The result I want from the above example would be:
0 what does it say?
1 should it say more?
2
The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.
a = df['Sentence']
b = df['First2']
#The function seems to loop over all r's but only over the first b:
def func(z):
for x in b:
if x in r:
s = z[z.index(x):]
return s
else:
return ‘’
df['Segments'] = a.apply(func)
It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?
I believe there is a bug in your code.
else:
return ''
This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.
A sample working code is below:
# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
for first_two in first_twos:
if first_two in sentence:
s = sentence[sentence.index(first_two):]
return s
return ''
df['Segments'] = a.apply(func)
And the output:
df:
{
'First2': ['can I', 'should it', 'what does'],
'Segments': ['what does it say? ', 'should it say more?', ''],
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string. ' ]
}
you can loop over two things easily via zip(iterator,iterator_foo)
My question was answered by the following code:
def func(r):
for i in b:
if i in r:
q = r[r.index(i):]
return q
return ''
df['Segments'] = a.apply(func)
The solution was pointed out here by Daming Lu (only the last line is different from his). The problem was in the last two lines of the original code:
else:
return ''
This caused the function to return too early. Daming Lu's answer was better than the answer to the possible duplicate question python for-loop only executes once? which created other problems - as explained in my respons to wii. (So I am not sure mine really is a duplicate.)

How do I print the next item in a list that starts with a certain letter/symbol?

So I am making this troubleshooting system related to phones in Python that needs to give a solution after looking at the user's query. In my list, the keywords that will be matched to the user's query are first put in and then the relevant solution to those keywords. The problem is that different issues that the user might have have different number of keywords to choose from and if all of the solutions in my list start with a symbol such as { how can I write a code that prints the next item in the list starting with { ?
storage = ["wet", "water", "toilet", "{Wipe the phone and place it in a bag of rice for 24 hours.}"]
This is an example of the list that I made. The user's query is: "I dropped my phone in the toilet."
The solution to this problem is right after the word 'toilet' starting with a curly bracket. Can you please provide me with a code that will make the program print the next value in the list that starts with a curly bracket?
Given a starting point like:
query = "I dropped my phone in the toilet."
storage = ["wet", "water", "toilet", "{Wipe the phone and place it in a bag of rice for 24 hours.}"]
First we would need some kind of flag that indicates when we have found a matching keyword, lets say:
found_word = False
then we just iterate over storage like so:
for word in storage:
however not all the entries in storage are keywords, some are special values that need to be treated differently:
for word in storage:
if word.startswith("{"):
When we encounter a value like this, if we have found a keyword we want to print out this special value then stop looping:
if word.startswith("{"):
if found_word:
print(word)
break
otherwise if the keyword is in the query then we just set the flag to True:
elif word in query:
found_word = True
so our final code would be:
found_word = False
for word in storage:
if word.startswith("{"):
if found_word:
print(word)
break
elif word in query:
found_word = True
on the other hand, if you used a dict to store your data like:
wet_solve = "Wipe the phone and place it in a bag of rice for 24 hours."
solutions = {"wet":wet_solve, "water":wet_solve, "toilet":wet_solve}
Then you would just need to check all the words in the query for one in the solutions:
for word in query.split():
if word in solutions:
print(solutions[word])
The following will do exactly what you asked: "Can you please provide me with a code that will make the program print the next value in the list that starts with a curly bracket?"
def findNextCurly(keyword):
index = storage.index(keyword) + 1
while not storage[index].startswith("{"):
index = index+1
print (storage[index][1:-1])
>>> findNextCurly('test')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in findNextCurly
ValueError: 'test' is not in list
>>> findNextCurly('wet')
Wipe the phone and place it in a bag of rice for 24 hours.
... but nothing more.
The notation string[1:-1] is called string slicing and is explained in the official tutorial. Here it is used to remove the first and last characters from the string – the curly brackets.
The solution given by Rad Lexus will return any value that has a curly brace in it, i.e. it would also print hello{brace. If you only want the ones that start with a letter, try:
storage = ["wet", "water", "toilet", "{Wipe the phone and place it in a bag of rice for 24 hours.}"]
for s in storage:
if s.startswith("{"):
print(s)

fixing words with spaces using a dictionary look up in python?

I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.
Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.
--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0
I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.
Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words
You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.
my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work

Categories