Python - Splitting a string by special characters and numbers - python

I have a string that I want to split at every instance of an integer, unless an integer is directly followed by another integer. I then want to split that same string at "(" and ")".
myStr = ("H12(O1H2)2O2C1")
list1 = re.split('(\d+)', myStr)
print(list1)
list1 = re.split('(\W)', myStr)
print(list1)
I want the result to be ['H', '12', '(', 'O', '1', 'H', '2', ')', '2', 'O', '2', 'C', '1'].
After:
re.split('(\d+)', myStr)
I get:
['H', '12', '(O', '1', 'H', '2', ')', '2', 'O', '2', 'C', '1']
I now want to split up the open parenthesis and the "O" to make individual elements.
Trying to split up a list after it's already been split up the way I tried doesn't work.
Also, "myStr" eventually will be a user input, so I don't think that indexing through a known string (like myStr is in this example) would solve my issue.
Open to suggestions.

You have to use character set to get what you want, change (\d+) to something like this ([\d]+|[\(\)])
import re
myStr = ("H12(O1H2)2O2C12")
list1 = re.split('([\d]+|[\(\)])', myStr)
# print(list1)
noempty_list = list(filter(None, list1))
print(noempty_list)
Output:
['H', '12', '(', 'O', '1', 'H', '2', ')', '2', 'O', '2', 'C', '1']
You also have to match the () characters and without it will print (O, and since re.split returns a list with empty value(s), just remove it
With ([\d]+|[A-Z]) will work too but re.split will return more empty strings in the list

Related

Sort brackets after alphanumeric characters?

Working in Python 3:
a = ['(', 'z', 'a', '1', '{']
a.sort()
a
['(', '1', 'a', 'z', '{']
How can I sort the list so that alphanumeric characters come before punctuation characters:
a = ['(', 'z', 'a', '1', '{']
a.custom_sort()
a
['1', 'a', 'z', '(', '{']
(Actually I don't care about the order of the last two characters.)
This seems surprisingly difficult!
I understand that Python sorts asciibetically, and I'm looking for a human-readable sort. I found natsort but it only seems to deal with numbers.
You can use a key function for sort that returns a tuple to test if a given character is alphanumeric and use the character's lexicographical order as a secondary sorting key:
a.sort(key=lambda c: (not c.isalnum(), c))
You can pass a key function to sorted check if the value is in string.punctuation:
import string
punctuation = set(string.punctuation)
a = sorted(a, key=lambda x: (x in punctuation, x))
print(a)
#['1', 'a', 'z', '(', '{']
This approach explicitly checks if it's in the right sets:
import string
import sys
a = ['(', 'z', 'a', '1', '{']
def key(a):
if a in string.ascii_letters or a in string.digits:
return ord(a)
return sys.maxsize
a.sort(key=key)
print(a)

How to separate parts of list item into a new list?

I want to grab the return data from one of my function and create a new list holding the two values contained in the initial list as indexes for a new list.
What I'm currently trying is: I get a response from my serial port, I store the response data in a variable and then I perform a .split(' ') (assuming this will return a list holding items that were separated with a space) and it did.
What I'm trying to get is this: gax = ['110.00', '94.00']
My response without the .split(' ') method:
b'[06][1c]ans=33[0d] job=42985[0d] mid=001[0d] status=0;"ok"[0d]do=b[0d] crib=69.80;67.80[0d] gax=110.00;94.00[0d][1e][1d]'
My data with the .split(' ') method:
['[06][1c]ans=33[0d]', 'job=42985[0d]', 'mid=001[0d]',
'status=0;"ok"[0d]do=b[0d]', 'crib=69.80;67.80[0d]',
'gax=110.00;94.00[0d][1e][1d]']
and this is what I get when I tried a list comprehension:
['g', 'a', 'x', '=', '1', '1', '0', '.', '0', '0', ';', '9', '4', '.', '0', '0',
'[', '0', 'd', ']', '[', '1', 'e', ']', '[', '1', 'd', ']']
How can I achieve what I want to do?
Is list comprehension the right way to go?
def get_job_from_serial():
response_f = serial_client.readline()
print('job sent from host {}'.format(response_f))
return response_f
jrfromserial = get_job_from_serial()
j = jrfromserial.decode('utf-8').split(' ')
print('la lista de strings disponible son ------ >> {}'.format(j))
# here I was trying to remove the trailing part in brackets.
# what I got was gax=110.00;94.
pre_gax = j[5].rstrip('[0d][1e][1d]]')
print(pre_gax)
gax = [g for g in j[5]]
print(gax)
The result you're getting is because you're looping through a string value which will give you a list of characters
To resolve your gax issue you should just be able to do:
gax_nums = pre_gax.lstrip("gax=")
gax = [float(x) for x in gax_nums.split(";")]

using 'or' condition in re.split

I have a list of strings each one of those needs to be split when an 'y' or 'm' is found:
mylist = ['3m10y','10y20y','18m2y']
in the following items:
splitlist = [['3m','10y'],['10y','20y'],['18m','2y']]
i was thinking of using re.split() but I cannot use the 'or' condition in order to tell the function to split either when it finds an 'm' or an 'y'.
any help appreciated!
thanks
Try findall instead of split:
>>> re.findall(r'\d+[ym]', '3m10y')
['3m', '10y']
[my] is m or y.
>>> items = re.split(r'(m|y)', '10m2y4m55y55y53m')
>>> items
['10', 'm', '2', 'y', '4', 'm', '55', 'y', '55', 'y', '53', 'm', '']
>>> [''.join(p) for p in zip(items[::2], items[1::2])]
['10m', '2y', '4m', '55y', '55y', '53m']

Splitting a math expression string into tokens in Python

I have a lot of python strings such as "A7*4", "Z3+8", "B6 / 11", and I want to split these strings so that they would be in a list, in the format ["A7", "*", "4"], ["B6", "/", "11"], etc. I have used a lot of different split methods but I think I need to just perform the split where there is a math symbol, such as /,*,+,-. I would also need to strip out the whitespace.
Currently I am using the code re.split(r'(\D)', "B6 / 11"), which is returning ['', 'B', '6', ' ', '', '/', '', ' ', '11']. Instead I want to get back ["B6", "/", "11"].
You should split on the character set [+-/*] after removing the whitespace from the string:
>>> import re
>>> def mysplit(mystr):
... return re.split("([+-/*])", mystr.replace(" ", ""))
...
>>> mysplit("A7*4")
['A7', '*', '4']
>>> mysplit("Z3+8")
['Z3', '+', '8']
>>> mysplit("B6 / 11")
['B6', '/', '11']
>>>
There is a way to solve this without regular expressions by using the Python tokenizer. I used a more complex formula to show the capabilities of this solution.
from io import StringIO
import tokenize
formula = "(A7*4) - (Z3+8) - ( B6 / 11)"
print([token[1] for token in tokenize.generate_tokens(StringIO(formula).readline) if token[1]])
Result:
['(', 'A7', '*', '4', ')', '-', '(', 'Z3', '+', '8', ')', '-', '(', 'B6', '/', '11', ')']

How can I split a string into tokens?

If I have a string
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
In an ideal world, e and E would not be recognised as letters in the same way, so
'-4e1'
would become
['-', '4e1']
but
'-4x1'
would become
['-', '4', 'x', '1']
Can anybody help?
Use the regular expression module's split() function, to split at
'\d+' -- digits (number characters) and
'\W+' -- non-word characters:
CODE:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
CODE:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
Another alternative not suggested here, is to using nltk.tokenize module
Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

Categories