Python - Read Tokens from a String - python

I am reading all tokens (operators, int, str, etc.) from a string
My current code:
import re
expression = "($mat.name == 'sign') AND ($most == 100.0)"
tokens = re.findall("\$+[a-zA-Z]*\.[a-zA-Z]+|[a-zA-Z]+|Not|not|NOT|[=]+|[+/*()-]|[0-9]*\.[0-9]+|[0-9]+", expression)
The current result:
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', 'most', '==', '100.0', ')']
The problem is that, while the regex is correctly matching $mat.name, it matches most instead of $most.
Can you please help me fix the regular expression?

Brief
I'm not exactly sure what you're trying to accomplish, but you're matching most instead of $most because it doesn't contain a dot and your expression is saying match either \$+[a-zA-Z]*\.[a-zA-Z]+ or [a-zA-Z]+ and obviously the $most string doesn't contain a . so it's trying the next match.
Code
See regex in use here
\$*(?:[a-z]*\.)?[a-z]+|not|[+/*()-]|\d*\.\d+|[\d=]+
Note: The above regex simplifies the original regex and is to be used with the i flag (ignore case)
Usage
See code in use here
import re
expression = "($mat.name == 'sign') AND ($most == 100.0)"
tokens = re.findall(r"\$*(?:[a-z]*\.)?[a-z]+|not|[+/*()-]|\d*\.\d+|[\d=]+", expression, re.I)
print tokens
Results
Input
($mat.name == 'sign') AND ($most == 100.0)
Output
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']
Explanation
I made some other changes to your pattern, so I'll explain the whole thing.
Match any of the following
\$+(?:[a-z]*\.)?[a-z]+ Match the following
\$* Match $ literally any number of times
(?:[a-z]*\.)? Match the following zero or one time
[a-z]* Match any number of lowercase ASCII letters
\. Match a literal dot character .
[a-z]+ Match one or more lowercase ASCII letters
not Match this literally
[+/*()-] Match any character in the set
\d*\.\d+ Match the following
\d* Match any number of digits
\. Match a literal dot character .
\d+ Match one or more digits
[\d=]+ Match any character in the set (digits or =) one or more times

You should use the following modified regex instead.
import re
expression = "($mat.name == 'sign') AND ($most == 100.0)"
tokens = re.findall("\$+[a-zA-Z]*\.[a-zA-Z]+|\$+[a-zA-Z]*|[a-zA-Z]+|Not|not|NOT|[=]+|[+/*()-]|[0-9]*\.[0-9]+|[0-9]+", expression)
print(tokens)
Which will print
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']
See a full explanation of the modified regular expression here.

In your second token, you got to check is . is present or not as . isn't mandatory in every variable as such!
>>> re.findall("\$+[a-zA-Z]*\.?[a-zA-Z]+|[a-zA-Z]+|Not|not|NOT|[=]+|[+/*()-]|[0-9]*\.[0-9]+|[0-9]+", expression)
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']
a similar solution goes as follows(simplifying the regex in question) -
>>> re.findall('\$+[A-Z.a-z]+|[a-zA-Z]+|Not|not|NOT|=+|[+/*()-]|[0-9.]+', expression)
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']

Related

Error when creating a simple custom dynamic tokenizer in Python

I am trying to create a dynamic tokenizer, but it does not work as intended.
Below is my code:
import re
def tokenize(sent):
splitter = re.findall("\W",sent)
splitter = list(set(splitter))
for i in sent:
if i in splitter:
sent.replace(i, "<SPLIT>"+i+"<SPLIT>")
sent.split('<SPLIT>')
return sent
sent = "Who's kid are you? my ph. is +1-6466461022.Bye!"
tokens = tokenize(sent)
print(tokens)
This does not work!
I expected it to return the below list:
["Who", "'s", "kid", "are", "you","?", "my" ,"ph",".", "is", "+","1","-",6466461022,".","Bye","!"]
This would be pretty trivial if it weren't for the special treatment of the '. I'm assuming you're doing NLP, so you want to take into account which "side" the ' belongs to. For instance, "tryin'" should not be split and neither should "'tis" (it is).
import re
def tokenize(sent):
split_pattern = rf"(\w+')(?:\W+|$)|('\w+)|(?:\s+)|(\W)"
return [word for word in re.split(split_pattern, sent) if word]
sent = (
"Who's kid are you? my ph. is +1-6466461022.Bye!",
"Tryin' to show how the single quote can belong to either side",
"'tis but a regex thing + don't forget EOL testin'",
"You've got to love regex"
)
for item in sent:
print(tokenize(item))
The python re lib evaluates patterns containing | from left to right and it is non-greedy, meaning it stops as soon as a match is found, even though it's not the longest match.
Furthermore, a feature of the re.split() function is that you can use match groups to retain the patterns/matches you're splitting at (otherwise the string is split and the matches where the splits happen are dropped).
Pattern breakdown:
(\w+')(?:\W+|$) - words followed by a ' with no word characters immediately following it. E.g., "tryin'", "testin'". Don't capture the non-word characters.
('\w+) - ' followed by at least one word character. Will match "'t" and "'ve" in "don't" and "they've", respectively.
(?:\s+) - split on any whitespace, but discard the whitespace itself
(\W) - split on all non-word characters (no need to bother finding the subset that's present in the string itself)
You can use
[x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x]
See the regex demo. The pattern matches
( - Group 1 (as these texts are captured into a group these matches appear in the resulting list):
[^'\w\s] - any char other than ', word and whitespace char
| - or
'(?![^\W\d_]) - a ' not immediately followed with a letter ([^\W\d_] matches any Unicode letter)
| - or
(?<![^\W\d_])' - a ' not immediately preceded with a letter
) - end of the group
| - or
(?='(?<=[^\W\d_]')(?=[^\W\d_])) - a location right before a ' char that is enclosed with letters
| - or
\s+ - one or more whitespace chars.
See the Python demo:
import re
sents = ["Who's kid are you? my ph. is +1-6466461022.Bye!", "Who's kid are you? my ph. is +1-6466461022.'Bye!'"]
for sent in sents:
print( [x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x] )
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', 'Bye', '!']
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', "'", 'Bye', '!', "'"]

Split string with regex by new lines, symbols and withspaces in python

I'm new to regex library, and I'm trying to make from a text like this
"""constructor SquareGame new(){
let square=square;
}"""
This outputs a list:
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '}']
I need to create a list of tokens separated by white spaces, new lines and this symbols {}()[].;,+-*/&|<>=~.
I used re.findall('[,;.()={}]+|\S+|\n', text) but seems to separate tokens by withe spaces and new lines only.
You may use
re.findall(r'\w+|[^\w \t]', text)
To avoid matching any Unicode horizontal whitespace use
re.findall(r'\w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]', text)
See the regex demo. Details:
\w+ - 1 or more word chars
| - or
[^\w \t] - a single non-word char that is not a space and a tab char (so, all vertical whitespace is matched).
You may add more horizontal whitespace chars to exclude into the [^\w \t] character class, see their list at Match whitespace but not newlines. The regex will look like \w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000].
See the Python demo:
import re
pattern = r"\w+|[^\w \t]"
text = "constructor SquareGame new(){\nlet square=square;\n}"
print ( re.findall(pattern, text) )
# => ['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}']
This regex will only match based on the characters that you indicated and I think this is a safer method.
>>> re.findall(r"\w+|[{}()\[\].;,+\-*/&|<>=~\n]", text)
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}'

How to remove punctuation from a string [duplicate]

This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 3 years ago.
One of the project that I've been working on is to create a word counter, and to do that, I have to effectively remove all punctuation from a string.
I have tried using the split method and split at punctuation, however, this will later make the list very weird (from separating at a word to having a list that has 5 words). I then tried to have a list or a string full of punctuation, and use a for loop to eliminate all punctuation, but both are not successful
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
for i in content_string.lower():
if i in punctuation:
i = i.replace[i," "]
else:
i = i
It says that
"TypeError: 'type' object is not subscriptable"
This message appears both when using a string or using a list.
There is a mix with parenthesis versus square brackets.
list and replace are functions, arguments are passed with parenthesis.
Also, try to describe your algorithm with words:
example:
For all forbidden characters, i want to remove them from my content (replace with space)
Here is an implementation you can start with:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
for i in punctuation:
content_string = content_string.replace(i, " ")
To create a list, you use l = [...] not l = list[...], and functions/methods (such as str.replace) are called with parenthesis, not square brackets, however, you can use re.sub to do this in a much better and simpler way:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'] # '(', ')' not `()`
import re
new_string = re.sub('|'.join(map(re.escape, punctuation)), '', content_string)
print(new_string)
Output:
This is a test to see whether or not the code can eliminate punctuation
Your error
"TypeError: 'type' object is not subscriptable"
comes from the line
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
To define a list you either use brackets [ ] without the keyword list, or if you use list you have to put parenthesis (although in this case converting a list into a list is redundant)
# both options will work, but the second one is redundant and therefore wrong
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'])
Notice that the last element () must be splitted in two elements ( and )
Now to achieve what you want in an efficient way, use a conditional comprehension list
''.join([i if i not in punctuation else ' ' for i in content_string])
result:
'This is a test to see whether or not the code can eliminate punctuation'
Notice that according to your code, you are not removing the punctuation symbols but replacing them for spaces.
There are multiple bugs in the code.
First one:
The list keyword is obsolete.
If you wanted to use it, you would need to add parentheses () so that the call would be properly done on the items in the already defined list.
BAD punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
BETTER punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '()'])
But simply defining the list with regular [] syntax would be enough, and also more efficient than a list() call.
Second one:
You will not be able to replace parentheses with the if i in punctuation: check.
This is because they are a two character long string, and you are iterating over single characters of your string. So you will always compare '(' or ')' with '()'.
A possible fix - add parentheses separately to the punctuation list as single characters.
Third bug, or rather an obsolete else instruction:
else:
i = i
This servers no purpose whatsoever, you should skip the else instruction.
Fourth, the most apparent bug:
In your for loop you are editing i variable which is a copy of a single character from the string that you are iterating over. You should perform the change on the original string, this could be done with the usage of enumerate - only if you first turned your string into a list, so that you could modify its values.
for i, char in enumerate(list(content_string.lower())):
if char in punctuation:
content_string[i] = ' '
Anyway, the goal you are trying to achieve can come down to a one-liner, using a list comprehension and a string join on the resulting list afterwards:
content_string = ''.join([char if char not in punctuation else ' ' for char in content_string.lower()])

Regular expression (Reg exp). Why this works?

So, I am trying to get my head around regexp. The first query doesn't give me the result but the second one does. I am not able to make sense, why that is.
I am trying to tokenize the sentence,
text = 'The interest does not exceed 8.25%.'
pattern = r'\w+|\d+\.\d+\%|[^\w+\s]+'
tokenizer = RegexpTokenizer(pattern)
tokenizer.tokenize(text)
This gives me
['The', 'interest', 'does', 'not', 'exceed', '8', '.', '25', '%']
And I want
['The', 'interest', 'does', 'not', 'exceed', '8.25%']
I get my result with,
pattern = r'\d+\.\d+\%|\w+|[^\w+\s]+'
Why does it work with the second pattern? Shouldn't both the queries work?
the issue is that \w matches letters, digits and underscores. Since the expression comes first in your ored expressions, it's prioritary.
['The', 'interest', 'does', 'not', 'exceed', '8', '.', '25', '%']
\w+ \w+ \w+ \w+ \w+ \w+ [^\w\s]+ \w+ [^\w\s]+
The second expression never has a chance to match because it's partly consumed by the first one.
Invert the ored expressions:
r'\d+\.\d+\%|\w+|[^\w\s]+'
just a test with the basic re module:
text = 'The interest does not exceed 8.25%.'
pattern = r'\d+\.\d+%|\w+|[^\w\s]+'
print(re.findall(pattern,text))
prints:
['The', 'interest', 'does', 'not', 'exceed', '8.25%', '.']
(note that you don't have to escape %)

Python - parsing user input using a verbose regex

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.
My test "sentence" = How I'm 15.5% wholesome-looking U.S.A. we RADAR () [] {} you -- are, ... you?
text = input("please type somewhat coherently: ")
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters
# |[\d+(\.\d+)?%] # percentages, 82%
|[][\{\}.,;"'?():-_`] # these are separate tokens
'''
parsed = re.findall(pattern, text)
print(parsed)
My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'U.S.A.', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']
I am looking to have the '15', '.', '5', '%' parsed as '15.5%'. The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.
Thank you for you time.
If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.
Thus, you need to pull the alternative \d+(?:\.\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:
(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?% # percentages, 82% <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters#
|[][{}.,;"'?():_`-] # these are separate tokens
See regex demo.
Also, please note I replaced [][\{\}.,;"'?():-_`] with [][{}.,;"'?():_`-]: braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ;, <, =, >, ?, #, all the uppercase Latin letters, [, \, ] and ^.

Categories