preprocessing latex-formatted math questions - python

I am coding a quantitative problem soling AI and want to train it on the MATH dataset, which contains math questions and answers like this:
{
"problem": "Let \\[f(x) = \\left\\{\n\\begin{array}{cl} ax+3, &\\text{ if }x>2, \\\\\nx-5 &\\text{ if } -2 \\le x \\le 2, \\\\\n2x-b &\\text{ if } x <-2.\n\\end{array}\n\\right.\\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper).",
"level": "Level 5",
"type": "Algebra",
"solution": "For the piecewise function to be continuous, the cases must \"meet\" at $2$ and $-2$. For example, $ax+3$ and $x-5$ must be equal when $x=2$. This implies $a(2)+3=2-5$, which we solve to get $2a=-6 \\Rightarrow a=-3$. Similarly, $x-5$ and $2x-b$ must be equal when $x=-2$. Substituting, we get $-2-5=2(-2)-b$, which implies $b=3$. So $a+b=-3+3=\\boxed{0}$."
}
My question is, how could you preprocess this to anything a language model would understand? My first approach was to encode each unique character as a different integer, but how could one do this in python so that a string like "\ \ " is encoded as two backslashes and not one? More generally, skipping all the preprocessing of strings that is done automatically when a string is created in python, and just getting every single character as it is?

Related

Insert a string within another string based with probability based on location

I'm working in Python and I would like to insert a string within another string at a random location. But I would like the choice of random location based on a probability distribution that favors certain locations more than others. Specifically, I want to insert more strings towards beginning of the original string and less towards the end.
For example, if the insertion string is "I go here" and the original string is "this is a test string and it can be long." I want to insert the insertion string at a random location in the original string. But if I do it say 100 times, I would the result "I go here this is a test string and it can be long" to be the result more number of times than "this is a test string and it can be long. I go here". I want to be able to tune the probability distribution.
Any help is appreciated.
You can use the random.gauss() function.
It returns a random number with gaussian distribution.
The function takes two parameters: mean and sigma. mean is the expected mean of the function outputs and sigma is the standard deviation- 'how far the values will be from the mean'.
Try something like this:
import random
original_str_len = 7
mean = 0
sigma= original_str_len # just an example.
r = random.gauss(mean, sigma) # random number- can be negative
r = abs(r)
insertion_index = int(r)
Notice, this code is not perfect but the general idea should work.
I recommend you to read more about the Gaussian Distribution.

How to handle categorical independent variables in sklearn decision trees

I converted all my categorical independent variables from strings to numeric (binary 1's and 0's) using onehotencoder, but when i run a decision tree the algorithm is considering binary categorical variable as continuous.
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
how to convert this numeric continuous to numeric categorical?
how to convert this numeric continuous to numeric categorical?
If the result is the same, would you need it?
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
Maybe I am wrong, but this split makes sense for me.
Let's say we have a decision tree with a split rule that is categorical.
The division would be a binary division, meaning "0" is left and "1" is right (in this case).
Now, how can we optimize this division rule? Instead of searching if a value is "0" or "1", we can use one action to replace these two checks. "0" is left and everything else is right. Now, we can replace this same check from category to a float, <0.5 is left, else is right.
In code, it would be as simple as:
Case 1:
if value == "0":
tree.left()
elif value == "1":
tree.right()
else:
pass # if you work with binary, this will never happen, so its useless
Case 2:
if value == "0":
tree.left()
else:
tree.right()
Case 3:
if value < 0.5:
tree.left()
else:
tree.right()
There are basically 2 ways to deal with this. You can use
Integer encoding (if the categorical variable is ordinal in nature like size etc)
One-hot encoding (if the categorical variable is ordinal in nature
like gender etc)
It seems you have wrongly implemented one-hot encoding for this problem. What you are using is simple integer encoding (or binary encoding, to be more specific). Correctly implemented one-hot encoding ensures that there is no bias in the converted values and the results of performing the machine learning algorithm is not swayed away in favour of a variable just because of its sheer value. You can read more about it here.

Solve Equation (String) with Python to every Symbol

I have to solve a equation in python, which i get as a string input. I don't know how many symbols are in the equation or what their signature is. A typical Symbol could be "mm", "cm", "x" or something like this. The function should return some kind of array/json with the solved equation.
Little Example how it should look like:
solve("x/2=4")
>> ["x=8"]
>>
solve("x + 2 = y - 1")
>> ["x=y-3", "y=x+3"]
I tried to use SymPy-Module for this, but I didn't find a way to enter a dynamic string like above. SymPy seems to only accept "hardcoded" Symbols.
Note: String comes from a "sys.argv"-Parameter.
SymPy can parse strings with sympify, but its format for equations is Eq(x/2, 4) instead of x/2 = 4. So a little preprocessing is necessary: surround the string with Eq( ) and replace "=" by a comma.
eq = "x/2=4"
sympy_eq = sympify("Eq(" + eq.replace("=", ",") + ")")
solve(sympy_eq) # [8]
and
eq = "x + 2 = y - 1"
sympy_eq = sympify("Eq(" + eq.replace("=", ",") + ")")
solve(sympy_eq) # [{x: y - 3}]
In the latter case, SymPy picked one of the variables to solve for. To choose which one it should be, you can provide a Symbol:
solve(sympy_eq, Symbol('y')) # [x + 3]
Or, to solve for every symbol:
[solve(sympy_eq, sym, dict=True) for sym in sympy_eq.free_symbols]
returns [[{y: x + 3}], [{x: y - 3}]]. The list is nested because multiple solutions could appear for each symbol. Flatten the nested list if necessary.
The options list=True and dict=True of solve are convenient for enforcing particular forms of output.
The answer is most probably two different parts.
Parsing:
Parsing means turning some input into a usable output, where in your case the input is some string and the output is something, sympy can work with. A simple parsing step for example is turning strings into integers by doing int(your_string). In your case, you should iterate through your string and find variables, units etc. for example by comparing with a dictionary or a list of strings. Parsing arbitrary input is quite hard, so the best idea is starting with a small set of options, e.g. search the string for occurences of typical variable names like x, y and z by comparing with a list variables=['x','y','z'].
Computing
Once the parsing is clean, simply plug everything into your number crunching / solvers used by sympy.
To see how such a system can work if done correctly, you can have a look at wolfram alpha. They do a quite good parsing / natural language processing and try to guess what to do from there.

Difflib SequenceMatcher - how to determine 'equal' for similarity of more than one char?

I wrote a Python module that diffs between two HTML source codes.
I have a little problem to compare the text - the difflib.SequenceMatcher function determines text as 'equal' even if only one character if similar.
So genereted values like "123456" and "abc1de" will categorized as inserted "abc", equals on 1=1 and replaced 23456 in de.
In conclusion, how can I determine that the 'equal' classification will be set only if the equal length will be more than 3 chars?

python parallel computing: split keyspace to give each node a range to work on

My question is rather complicated for me to explain, as i'm not really good at maths, but i'll try to be as clear as possible.
I'm trying to code a cluster in python, which will generate words given a charset (i.e. with lowercase: aaaa, aaab, aaac, ..., zzzz) and make various operations on them.
I'm searching how to calculate, given the charset and the number of nodes, what range each node should work on (i.e.: node1: aaaa-azzz, node2: baaa-czzz, node3: daaa-ezzz, ...). Is it possible to make an algorithm that could compute this, and if it is, how could i implement this in python?
I really don't know how to do that, so any help would be much appreciated
Any way that you could compute a small integer from the string would be fine for clustering. For example, compute a hash with md5, and look at a byte of it:
import hashlib
s = "aaac"
num_nodes = 5 # or whatever
m = hashlib.md5(s)
node = ord(m.digest()[0]) % num_nodes
print node # prints 2
This won't guarantee to evenly distribute all the strings, but it will be close.
You should be able to treat your words as numerals in a strange base. For example, let's say you have a..z as your charset (26 characters), 4 character strings, and you want to distribute among equally 10 machines. Then there are a total of 26^4 strings, so each machine gets 26^4/10 strings. The first machine will get strings 0 through 26^4/10, the next 26^4/10 through 26^4/5, etc.
To convert the numbers to strings, just write the number in base 26 using your charset as the numbers. So 0 is 'aaaa' and 26^4/10 = 2*26^3 + 15*26^2 + 15*26 +15 is 'cppp'.

Categories