How can I split a string into tokens? - python

If I have a string
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
In an ideal world, e and E would not be recognised as letters in the same way, so
'-4e1'
would become
['-', '4e1']
but
'-4x1'
would become
['-', '4', 'x', '1']
Can anybody help?

Use the regular expression module's split() function, to split at
'\d+' -- digits (number characters) and
'\W+' -- non-word characters:
CODE:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
CODE:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']

Another alternative not suggested here, is to using nltk.tokenize module

Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

Related

Split a String with multiple delimiter and get the used delimiter

I want to split a String in python using multiple delimiter. In my case I also want the delimiter which was used returned in a list of delimiters.
Example:
string = '1000+20-12+123-165-564'
(Methods which split the string and return lists with numbers and delimiter)
numbers = ['1000', '20', '12', '123', '165', '564']
delimiter = ['+', '-', '+', '-', '-']
I hope my question is understandable.
You might use re.split for this task following way
import re
string = '1000+20-12+123-165-564'
elements = re.split(r'(\d+)',string) # note capturing group
print(elements) # ['', '1000', '+', '20', '-', '12', '+', '123', '-', '165', '-', '564', '']
numbers = elements[1::2] # last 2 is step, get every 2nd element, start at index 1
delimiter = elements[2::2] # again get every 2nd element, start at index 2
print(numbers) # ['1000', '20', '12', '123', '165', '564']
print(delimiter) # ['+', '-', '+', '-', '-', '']
Just capture (...) the delimiter along with matching/splitting with re.split:
import re
s = '1000+20-12+123-165-564'
parts = re.split(r'([+-])', s)
numbers, delims = parts[::2], parts[1::2]
print(numbers, delims)
['1000', '20', '12', '123', '165', '564'] ['+', '-', '+', '-', '-']

Python - Splitting a string by special characters and numbers

I have a string that I want to split at every instance of an integer, unless an integer is directly followed by another integer. I then want to split that same string at "(" and ")".
myStr = ("H12(O1H2)2O2C1")
list1 = re.split('(\d+)', myStr)
print(list1)
list1 = re.split('(\W)', myStr)
print(list1)
I want the result to be ['H', '12', '(', 'O', '1', 'H', '2', ')', '2', 'O', '2', 'C', '1'].
After:
re.split('(\d+)', myStr)
I get:
['H', '12', '(O', '1', 'H', '2', ')', '2', 'O', '2', 'C', '1']
I now want to split up the open parenthesis and the "O" to make individual elements.
Trying to split up a list after it's already been split up the way I tried doesn't work.
Also, "myStr" eventually will be a user input, so I don't think that indexing through a known string (like myStr is in this example) would solve my issue.
Open to suggestions.
You have to use character set to get what you want, change (\d+) to something like this ([\d]+|[\(\)])
import re
myStr = ("H12(O1H2)2O2C12")
list1 = re.split('([\d]+|[\(\)])', myStr)
# print(list1)
noempty_list = list(filter(None, list1))
print(noempty_list)
Output:
['H', '12', '(', 'O', '1', 'H', '2', ')', '2', 'O', '2', 'C', '1']
You also have to match the () characters and without it will print (O, and since re.split returns a list with empty value(s), just remove it
With ([\d]+|[A-Z]) will work too but re.split will return more empty strings in the list

Convert a string that includes both characters and integers to a list in Python

I know how to convert for instance:
'1-2=3^4/5' -> [1, '-', 2, '=', 3, '^', 4, '/', 5]
but if let's say I want to convert:
'12-34=56^78/90' -> [12, '-', 34, '=', 56, '^', 78, '/', 90]
Then I have issues.
I tried several things and it never worked perfectly - it either had an edge case where it was not working or there were issues. For instance, one of the problem I had was that the digits after the 1st one of an int was repeated as new elements.
I would greatly appreciate if anyone can take some time to help me.
Thx in advance!
EDIT: Thx to everyone for your quick answers! However, I am kinda new to programming and hence not familiar w/ the modules or methods used.
Would it be possible to do it using only built-in functions?
A simple pattern that select either some digits or a non-digit, will do it
pat = re.compile(r"\d+|\D")
parts = pat.findall("1-2=3^4/5")
print(parts) # ['1', '-', '2', '=', '3', '^', '4', '/', '5']
parts = pat.findall("12-34=56^78/90")
print(parts) # ['12', '-', '34', '=', '56', '^', '78', '/', '90']
Use itertools.groupby to group by consecutive digits (using str.isdigit)
from itertools import groupby
s = '12-34=56^78/90'
res = ["".join(group) for k, group in groupby(s, key=str.isdigit)]
print(res)
Output
['12', '-', '34', '=', '56', '^', '78', '/', '90']
The other two answers are way, way better. But if you feel compelled to do it without any imports, here is a solution.
s = '12-34=56^78/90'
output = []
section = []
for e in s:
try:
e = int(e)
section.append(e)
except ValueError:
output.append(''.join(map(str,section)))
output.append(e)
section = []
output.append(''.join(map(str,section)))

How to split a string that includes sign characters

How can I split a string that includes "sign characters" but no spaces? For example:
aString = '1+20*40-3'
I want the output to be:
['1', '+', '20', '*', '40', '-', '3']
I tried this:
aString.split('+' and '*' and '-')
but that didn't work.
You can use regular expression to do this task in python. The code will be:
import re
aString= '1+20*40-3'
print re.findall('[+-/*]|\d+',aString)
output:
>>>
['1', '+', '20', '*', '40', '-', '3']
Refer documentation here

Splitting a math expression string into tokens in Python

I have a lot of python strings such as "A7*4", "Z3+8", "B6 / 11", and I want to split these strings so that they would be in a list, in the format ["A7", "*", "4"], ["B6", "/", "11"], etc. I have used a lot of different split methods but I think I need to just perform the split where there is a math symbol, such as /,*,+,-. I would also need to strip out the whitespace.
Currently I am using the code re.split(r'(\D)', "B6 / 11"), which is returning ['', 'B', '6', ' ', '', '/', '', ' ', '11']. Instead I want to get back ["B6", "/", "11"].
You should split on the character set [+-/*] after removing the whitespace from the string:
>>> import re
>>> def mysplit(mystr):
... return re.split("([+-/*])", mystr.replace(" ", ""))
...
>>> mysplit("A7*4")
['A7', '*', '4']
>>> mysplit("Z3+8")
['Z3', '+', '8']
>>> mysplit("B6 / 11")
['B6', '/', '11']
>>>
There is a way to solve this without regular expressions by using the Python tokenizer. I used a more complex formula to show the capabilities of this solution.
from io import StringIO
import tokenize
formula = "(A7*4) - (Z3+8) - ( B6 / 11)"
print([token[1] for token in tokenize.generate_tokens(StringIO(formula).readline) if token[1]])
Result:
['(', 'A7', '*', '4', ')', '-', '(', 'Z3', '+', '8', ')', '-', '(', 'B6', '/', '11', ')']

Categories