I have a doubt at using Natural Language ToolKit (NLTK). I'm trying to make an app in order to translate a Natural Language Question into it's logic representation, and query to a database.
The result I got after using the simplify() method under nltk.sem.logic package and got the following expression:
exists z2.(owner(fido, z2) & (z0 = z2))
But what I need is to simplify it as follow:
owner(fido, z0)
Is there another method that could reduce the sentence as I want?
In NLTK, simplify() performs beta reduction (according to the book) which is not what you need. What you are asking is only doable with theorem provers when you apply certain tactics. Which in this case, you either need to know what you expect to get at the end or you know what kinds of axioms can be applied to get such result.
The theorem prover in NLTK is Prover9 which provides tools to check entailment relations. Basically, you can only check if there is a proof with a limited number of steps from a list of expressions (premises) to a goal expression. In your case for example, this was the result:
============================== PROOF =================================
% -------- Comments from original proof --------
% Proof 1 at 0.00 (+ 0.00) seconds.
% Length of proof is 8.
% Level of proof is 4.
% Maximum clause weight is 4.
% Given clauses 0.
1 (exists x (owner(fido,x) & y = x)) # label(non_clause). [assumption].
2 owner(fido,x) # label(non_clause) # label(goal). [goal].
3 owner(fido,f1(x)). [clausify(1)].
4 x = f1(x). [clausify(1)].
5 f1(x) = x. [copy(4),flip(a)].
6 -owner(fido,c1). [deny(2)].
7 owner(fido,x). [back_rewrite(3),rewrite([5(2)])].
8 $F. [resolve(7,a,6,a)].
============================== end of proof ==========================
In NLTK python:
from nltk import Prover9
from nltk.sem import Expression
read_expr = Expression.fromstring
p1 = read_expr('exists z2.(owner(fido, z2) & (z0 = z2))')
c = read_expr('owner(fido, z0)')
result = Prover9().prove(c, [p1])
print(result)
# returns True
UPDATE
In case that you insist on using available tools in python and you want to manually check this certain pattern with regular expressions. You can probably do something like this with regular expression (I don't approve but let's try my nasty tactic):
def my_nasty_tactic(exp):
parameter = re.findall(r'exists ([^.]*)\..*', exp)
if len(parameter) == 1:
parameter = parameter[0]
substitution = re.findall(r'&[ ]*\([ ]*([^ ]+)[ ]*=[ ]*'+parameter+r'[ ]*\)', exp)
if len(substitution) == 1:
substitution = substitution[0]
exp_abs = re.sub(r'exists(?= [^.]*\..*)', "\ ", exp)
exp_abs = re.sub(r'&[ ]*\([ ]*' + substitution + '[ ]*=[ ]*'+parameter+r'[ ]*\)', '', exp_abs)
return read_expr('(%s)(%s)' % (exp_abs, substitution)).simplify()
Then you can use it like this:
my_nasty_tactic('exists z2.(owner(fido, z2) & (z0 = z2))')
# <ApplicationExpression owner(fido,z0)>
Related
how this task can be carried out is more optimized.
Various data come to me.
Example data : test/123, test , test/, test/123/
and I need to write the correct data to my database, but first I need to find it. The correct ones will be in the test/123/ format
and then divide them into variables
a = test
b = 123
Tell me how it can be done ?
the data can be of any size
This answers for a data set of one or more letters ("test" in the example above) and one or more digits ("123" in the example above)
import re
data = 'test/123/'
m = re.match(r'(\w+)/(\d+)/', data)
if m:
a = m[1]
b = m[2]
print(f'Found a = {a}, b = {b}')
else:
print('no match')
For more tweaks of the pattern, refer e.g. to python's Regular Expressions HOWTO
Im not a programer so go easy on me please ! I have a system of 4 linear equations and 4 unknowns, which I think I could use python to solve relatively easily. However my equations not of the form " 5x+2y+z-w=0 " instead I have algebraic constants c_i which I dont know the explicit numerical value of, for example " c_1 x + c_2 y + c_3 z+ c_4w=c_5 " would be one my four equations. So does a solver exist which gives answers for x,y,z,w in terms of the c_i ?
Numpy has a function for this exact problem: numpy.linalg.solve
To construct the matrix we first need to digest the string turning it into an array of coefficients and solutions.
Finding Numbers
First we need to write a function that takes a string like "c_1 3" and returns the number 3.0. Depending on the format you want in your input string you can either iterate over all chars in this array and stop when you find a non-digit character, or you can simply split on the space and parse the second string. Here are both solutions:
def find_number(sub_expr):
"""
Finds the number from the format
number*string or numberstring.
Example:
3x -> 3
4*x -> 4
"""
num_str = str()
for char in sub_expr:
if char.isdigit():
num_str += char
else:
break
return float(num_str)
or the simpler solution
def find_number(sub_expr):
"""
Returns the number from the format "string number"
"""
return float(sub_expr.split()[1])
Note: See edits
Get matrices
Now we can use that to split each expression into two parts: The solution and the equation by the "=". The equation is then split into sub_expressions by the "+" This way we would end turn the string "3x+4y = 3" into
sub_expressions = ["3x", "4y"]
solution_string = "3"
Each sub expression then needs to be fed into our find_numbers function. The End result can be appended to the coefficient and solution matrices:
def get_matrices(expressions):
"""
Returns coefficient_matrix and solutions from array of string-expressions.
"""
coefficient_matrix = list()
solutions = list()
last_len = -1
for expression in expressions:
# Note: In this solution all coefficients must be explicitely noted and must always be in the same order.
# Could be solved with dicts but is probably overengineered.
if not "=" in expression:
print(f"Invalid expression {expression}. Missing \"=\"")
return False
try:
c_string, s_string = expression.split("=")
c_strings = c_string.split("+")
solutions.append(float(s_string))
current_len = len(c_strings)
if last_len != -1 and current_len != last_len:
print(f"The expression {expression} has a mismatching number of coefficients")
return False
last_len = current_len
coefficients = list()
for c_string in c_strings:
coefficients.append(find_number(c_string))
coefficient_matrix.append(coefficients)
except Exception as e:
print(f"An unexpected Runtime Error occured at {coefficient}")
print(e)
exit()
return coefficient_matrix, solutions
Now let's write a simple main function to test this code:
# This is not the code you want to copy-paste
# Look further down.
from sys import argv as args
def main():
expressions = args[1:]
matrix, solutions = get_matrices(expressions)
for row in matrix:
print(row)
print("")
print(solutions)
if __name__ == "__main__":
main()
Let's run the program in the console!
user:$ python3 solve.py 2x+3y=4 3x+3y=2
[2.0, 3.0]
[3.0, 3.0]
[4.0, 2.0]
You can see that the program identified all our numbers correctly
AGAIN: use the find_number function appropriate for your format
Put The Pieces Together
These Matrices now just need to be pumped directly into the numpy function:
# This is the main you want
from sys import argv as args
from numpy.linalg import solve as solve_linalg
def main():
expressions = args[1:]
matrix, solutions = get_matrices(expressions)
coefficients = solve_linalg(matrix, solutions)
print(coefficients)
# This bit needs to be at the very bottom of your code to load all functions first.
# You could just paste the main-code here, but this is considered best-practice
if __name__ == '__main__':
main()
Now let's test that:
$ python3 solve.py x*2+y*4+z*0=20 x*1+y*1+z*-1=3 x*2+y*2+z*-3=3
[2. 4. 3.]
As you can see the program now solves the functions for us.
Out of curiosity: Math homework? This feels like math homework.
Edit: Had a typo "c_string" instead of "c_strings" worked out in all tests out of pure and utter luck.
Edit 2: Upon further inspection I would reccomend to split the sub-expressions by a "*":
def find_number(sub_expr):
"""
Returns the number from the format "string number"
"""
return float(sub_expr.split("*")[1])
This results in fairly readable input strings
I'm trying to use Python to call an API and clean a bunch of strings that represent a movie budget.
So far, I have the following 6 variants of data that come up.
"$1.2 million"
"$1,433,333"
"US$ 2 million"
"US$1,644,736 (est.)
"$6-7 million"
"£3 million"
So far, I've only gotten 1 and 2 parsed without a problem with the following code below. What is the best way to handle all of the other cases or a general case that may not be listed below?
def clean_budget_string(input_string):
number_to_integer = {'million' : 1000000, 'thousand' : 1000}
budget_parts = input_string.split(' ')
#Currently, only indices 0 and 1 are necessary for computation
text_part = budget_parts[1]
if text_part in number_to_integer:
number = budget_parts[0].lstrip('$')
int_representation = number_to_integer[text_part]
return int(float(number) * int_representation)
else:
number = budget_parts[0]
idx_dollar = 0
for idx in xrange(len(number)):
if number[idx] == '$':
idx_dollar = idx
return int(number[idx_dollar+1:].replace(',', ''))
The way I would approach a parsing task like this -- and I'm happy to hear other opinions -- would be to break up your function into several parts, each of which identify a single piece of information in the input string.
For instance, I'd start by identifying what float number can be parsed from the string, ignoring currency and order of magnitude (a million, a thousand) for now :
f = float(''.join([c for c in input_str if c in '0123456789.']))
(you might want to add error handling for when you end up with a trailing dot, because of additions like 'est.')
Then, in a second step, you determine whether the float needs to be multiplied to adjust for the correct order of magnitude. One way of doing this would be with multiple if-statements :
if 'million' in input_str :
oom = 6
elif 'thousand' in input_str :
oom = 3
else :
oom = 1
# adjust number for order of magnitude
f = f*math.pow(10, oom)
Those checks could of course be improved to account for small differences in formatting by using regular expressions.
Finally, you separately determine the currency mentioned in your input string, again using one or more if-statements :
if '£' in input_str :
currency = 'GBP'
else :
currency = 'USD'
Now the one case that this doesn't yet handle is the dash one where lower and upper estimates are given. One way of making the function work with these inputs is to split the initial input string on the dash and use the first (or second) of the substrings as input for the initial float parsing. So we would replace our first line of code with something like this:
if '-' in input_str :
lower = input_str.split('-')[0]
f = float(''.join([c for c in lower if c in '0123456789.']))
else :
f = float(''.join([c for c in input_str if c in '0123456789.']))
using regex and string replace method, i added the return of the curency as well if needed.
Modify accordingly to handle more input or multiplier like billion etc.
import re
# take in string and return integer amount and currency
def clean_budget_string(s):
mult_dict = {'million':1000000,'thousand':1000}
tmp = re.search('(^\D*?)\s*((?:\d+\.?,?)+)(?:-\d+)?\s*((?:million|thousand)?)', s).groups()
currency = tmp[0]
mult = tmp[-1]
tmp_int = ''.join(tmp[1:-1]).replace(',', '') # join digits and multiplier, remove comma
tmp_int = int(float(tmp_int) * mult_dict.get(mult, 1))
return tmp_int, currency
>>? clean_budget_string("$1.2 million")
(1200000, '$')
>>? clean_budget_string("$1,433,333")
(1433333, '$')
>>? clean_budget_string("US$ 2 million")
(2000000, 'US$')
>>? clean_budget_string("US$1,644,736 (est.)")
(1644736, 'US$')
>>? clean_budget_string("$6-7 million")
(6000000, '$')
>>? clean_budget_string("£3 million")
(3000000, '£') # my script don't recognize the £ char, might need to set the encoding properly
I have a (large) list of parsed sentences (which were parsed using the Stanford parser), for example, the sentence "Now you can be entertained" has the following tree:
(ROOT
(S
(ADVP (RB Now))
(, ,)
(NP (PRP you))
(VP (MD can)
(VP (VB be)
(VP (VBN entertained))))
(. .)))
I am using the set of sentence trees to induce a grammar using nltk:
import nltk
# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()
# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)
Now I would like to use grammar to generate new, random sentences. My hope is that since the grammar was learned from a specific set of input examples, then the generated sentences will be semantically similar. Can I do this in nltk?
If I can't use nltk to do this, do any other tools exist that can take the (possibly reformatted) grammar and generate sentences?
In NLTK 2.0 you can use nltk.parse.generate to generate all possible sentences for a given grammar.
This code defines a function which should generate a single sentence based on the production rules in a (P)CFG.
# This example uses choice to choose from possible expansions
from random import choice
# This function is based on _generate_all() in nltk.parse.generate
# It therefore assumes the same import environment otherwise.
def generate_sample(grammar, items=["S"]):
frags = []
if len(items) == 1:
if isinstance(items[0], Nonterminal):
for prod in grammar.productions(lhs=items[0]):
frags.append(generate_sample(grammar, prod.rhs()))
else:
frags.append(items[0])
else:
# This is where we need to make our changes
chosen_expansion = choice(items)
frags.append(generate_sample,chosen_expansion)
return frags
To make use of the weights in your PCFG, you'll obviously want to use a better sampling method than choice(), which implicitly assumes all expansions of the current node are equiprobable.
First of all, if you generate random sentences, they may be semantically correct, but they will probably lose their sense.
(It sounds to me a bit like those MIT students did with their SCIgen program which is auto-generating scientific paper. Very interesting btw.)
Anyway, I never did it myself, but it seems possible with nltk.bigrams, you may way to have a look there under Generating Random Text with Bigrams.
You can also generate all subtrees of a current tree, I'm not sure if it is what you want either.
My solution to generate a random sentence from an existing nltk.CFG grammar:
def generate_sample(grammar, prod, frags):
if prod in grammar._lhs_index: # Derivation
derivations = grammar._lhs_index[prod]
derivation = random.choice(derivations)
for d in derivation._rhs:
generate_sample(grammar, d, frags)
elif prod in grammar._rhs_index:
# terminal
frags.append(str(prod))
And now it can be used:
frags = []
generate_sample(grammar, grammar.start(), frags)
print( ' '.join(frags) )
With an nltk Text object you can call 'generate()' on it which will "Print random text, generated using a trigram language model."http://nltk.org/_modules/nltk/text.html
Inspired by the above, here's one which uses iteration instead of recursion.
import random
def rewrite_at(index, replacements, the_list):
del the_list[index]
the_list[index:index] = replacements
def generate_sentence(grammar):
sentence_list = [grammar.start()]
all_terminals = False
while not all_terminals:
all_terminals = True
for position, symbol in enumerate(sentence_list):
if symbol in grammar._lhs_index:
all_terminals = False
derivations = grammar._lhs_index[symbol]
derivation = random.choice(derivations) # or weighted_choice(derivations) if you have a function for that
rewrite_at(position, derivation.rhs(), sentence_list)
return sentence_list
Or if you want the tree of the derivation, this one.
from nltk.tree import Tree
def tree_from_production(production):
return Tree(production.lhs(), production.rhs())
def leaf_positions(the_tree):
return [the_tree.leaf_treeposition(i) for i in range(len(the_tree.leaves()))]
def generate_tree(grammar):
initial_derivations = grammar._lhs_index[grammar.start()]
initial_derivation = random.choice(initial_derivations) # or weighed_choice if you have that function
running_tree = tree_from_production(initial_derivation)
all_terminals = False
while not all_terminals:
all_terminals = True
for position in leaf_positions(running_tree):
node_label = running_tree[position]
if node_label in grammar._lhs_index:
all_terminals = False
derivations = grammar._lhs_index[node_label]
derivation = random.choice(derivations) # or weighed_choice if you have that function
running_tree[position] = tree_from_production(derivation)
return running_tree
Here's a weighted_choice function for NLTK PCFG production rules to use with the above, adapted from Ned Batchelder's answer here for weighted choice functions in general:
def weighted_choice(productions):
prods_with_probs = [(prod, prod.prob()) for prod in productions]
total = sum(prob for prod, prob in prods_with_probs)
r = random.uniform(0, total)
upto = 0
for prod, prob in prods_with_probs:
if upto + prob >= r:
return prod
upto += prob
assert False, "Shouldn't get here"
Given a gettext Plural-Forms line, general a few example values for each n. I'd like this feature for the web interface for my site's translators, so that they know which plural form to put where. For example, given:
"Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%"
"10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;\n"
... I want the first text field to be labeled "1, 21..", then "2, 3, 4...", then "5, 6..." (not sure if this is exactly right, but you get the idea.)
Right now the best thing I can come up with is to parse the expression somehow, then iterate x from 0 to 100 and see what n it produces. This isn't guaranteed to work (what if the lowest x is over 100 for some language?) but it's probably good enough. Any better ideas or existing Python code?
Given that it's late, I'll bite.
The following solution is hacky, and relies on converting your plural form to python code that can be evaluated (basically converting the x ? y : z statements to the python x and y or z equivalent, and changing &&/|| to and/or)
I'm not sure if your plural form rule is a contrived example, and I don't understand what you mean with your first text field, but I'm sure you'll get where I'm going with my example solution:
# -*- Mode: Python -*-
# vi:si:et:sw=4:sts=4:ts=4
p = "Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;\n"
# extract rule
import re
matcher = re.compile('plural=(.*);')
match = matcher.search(p)
rule = match.expand("\\1")
# convert rule to python syntax
oldrule = None
while oldrule != rule:
oldrule = rule
rule = re.sub('(.*)\?(.*):(.*)', r'(\1) and (\2) or (\3)', oldrule)
rule = re.sub('&&', 'and', rule)
rule = re.sub('\|\|', 'or', rule)
for n in range(40):
code = "n = %d" % n
print n, eval(rule)