Split string with regex by new lines, symbols and withspaces in python - python

I'm new to regex library, and I'm trying to make from a text like this
"""constructor SquareGame new(){
let square=square;
}"""
This outputs a list:
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '}']
I need to create a list of tokens separated by white spaces, new lines and this symbols {}()[].;,+-*/&|<>=~.
I used re.findall('[,;.()={}]+|\S+|\n', text) but seems to separate tokens by withe spaces and new lines only.

You may use
re.findall(r'\w+|[^\w \t]', text)
To avoid matching any Unicode horizontal whitespace use
re.findall(r'\w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]', text)
See the regex demo. Details:
\w+ - 1 or more word chars
| - or
[^\w \t] - a single non-word char that is not a space and a tab char (so, all vertical whitespace is matched).
You may add more horizontal whitespace chars to exclude into the [^\w \t] character class, see their list at Match whitespace but not newlines. The regex will look like \w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000].
See the Python demo:
import re
pattern = r"\w+|[^\w \t]"
text = "constructor SquareGame new(){\nlet square=square;\n}"
print ( re.findall(pattern, text) )
# => ['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}']

This regex will only match based on the characters that you indicated and I think this is a safer method.
>>> re.findall(r"\w+|[{}()\[\].;,+\-*/&|<>=~\n]", text)
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}'

Related

Error when creating a simple custom dynamic tokenizer in Python

I am trying to create a dynamic tokenizer, but it does not work as intended.
Below is my code:
import re
def tokenize(sent):
splitter = re.findall("\W",sent)
splitter = list(set(splitter))
for i in sent:
if i in splitter:
sent.replace(i, "<SPLIT>"+i+"<SPLIT>")
sent.split('<SPLIT>')
return sent
sent = "Who's kid are you? my ph. is +1-6466461022.Bye!"
tokens = tokenize(sent)
print(tokens)
This does not work!
I expected it to return the below list:
["Who", "'s", "kid", "are", "you","?", "my" ,"ph",".", "is", "+","1","-",6466461022,".","Bye","!"]
This would be pretty trivial if it weren't for the special treatment of the '. I'm assuming you're doing NLP, so you want to take into account which "side" the ' belongs to. For instance, "tryin'" should not be split and neither should "'tis" (it is).
import re
def tokenize(sent):
split_pattern = rf"(\w+')(?:\W+|$)|('\w+)|(?:\s+)|(\W)"
return [word for word in re.split(split_pattern, sent) if word]
sent = (
"Who's kid are you? my ph. is +1-6466461022.Bye!",
"Tryin' to show how the single quote can belong to either side",
"'tis but a regex thing + don't forget EOL testin'",
"You've got to love regex"
)
for item in sent:
print(tokenize(item))
The python re lib evaluates patterns containing | from left to right and it is non-greedy, meaning it stops as soon as a match is found, even though it's not the longest match.
Furthermore, a feature of the re.split() function is that you can use match groups to retain the patterns/matches you're splitting at (otherwise the string is split and the matches where the splits happen are dropped).
Pattern breakdown:
(\w+')(?:\W+|$) - words followed by a ' with no word characters immediately following it. E.g., "tryin'", "testin'". Don't capture the non-word characters.
('\w+) - ' followed by at least one word character. Will match "'t" and "'ve" in "don't" and "they've", respectively.
(?:\s+) - split on any whitespace, but discard the whitespace itself
(\W) - split on all non-word characters (no need to bother finding the subset that's present in the string itself)
You can use
[x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x]
See the regex demo. The pattern matches
( - Group 1 (as these texts are captured into a group these matches appear in the resulting list):
[^'\w\s] - any char other than ', word and whitespace char
| - or
'(?![^\W\d_]) - a ' not immediately followed with a letter ([^\W\d_] matches any Unicode letter)
| - or
(?<![^\W\d_])' - a ' not immediately preceded with a letter
) - end of the group
| - or
(?='(?<=[^\W\d_]')(?=[^\W\d_])) - a location right before a ' char that is enclosed with letters
| - or
\s+ - one or more whitespace chars.
See the Python demo:
import re
sents = ["Who's kid are you? my ph. is +1-6466461022.Bye!", "Who's kid are you? my ph. is +1-6466461022.'Bye!'"]
for sent in sents:
print( [x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x] )
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', 'Bye', '!']
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', "'", 'Bye', '!', "'"]

How to remove punctuation from a string [duplicate]

This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 3 years ago.
One of the project that I've been working on is to create a word counter, and to do that, I have to effectively remove all punctuation from a string.
I have tried using the split method and split at punctuation, however, this will later make the list very weird (from separating at a word to having a list that has 5 words). I then tried to have a list or a string full of punctuation, and use a for loop to eliminate all punctuation, but both are not successful
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
for i in content_string.lower():
if i in punctuation:
i = i.replace[i," "]
else:
i = i
It says that
"TypeError: 'type' object is not subscriptable"
This message appears both when using a string or using a list.
There is a mix with parenthesis versus square brackets.
list and replace are functions, arguments are passed with parenthesis.
Also, try to describe your algorithm with words:
example:
For all forbidden characters, i want to remove them from my content (replace with space)
Here is an implementation you can start with:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
for i in punctuation:
content_string = content_string.replace(i, " ")
To create a list, you use l = [...] not l = list[...], and functions/methods (such as str.replace) are called with parenthesis, not square brackets, however, you can use re.sub to do this in a much better and simpler way:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'] # '(', ')' not `()`
import re
new_string = re.sub('|'.join(map(re.escape, punctuation)), '', content_string)
print(new_string)
Output:
This is a test to see whether or not the code can eliminate punctuation
Your error
"TypeError: 'type' object is not subscriptable"
comes from the line
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
To define a list you either use brackets [ ] without the keyword list, or if you use list you have to put parenthesis (although in this case converting a list into a list is redundant)
# both options will work, but the second one is redundant and therefore wrong
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'])
Notice that the last element () must be splitted in two elements ( and )
Now to achieve what you want in an efficient way, use a conditional comprehension list
''.join([i if i not in punctuation else ' ' for i in content_string])
result:
'This is a test to see whether or not the code can eliminate punctuation'
Notice that according to your code, you are not removing the punctuation symbols but replacing them for spaces.
There are multiple bugs in the code.
First one:
The list keyword is obsolete.
If you wanted to use it, you would need to add parentheses () so that the call would be properly done on the items in the already defined list.
BAD punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
BETTER punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '()'])
But simply defining the list with regular [] syntax would be enough, and also more efficient than a list() call.
Second one:
You will not be able to replace parentheses with the if i in punctuation: check.
This is because they are a two character long string, and you are iterating over single characters of your string. So you will always compare '(' or ')' with '()'.
A possible fix - add parentheses separately to the punctuation list as single characters.
Third bug, or rather an obsolete else instruction:
else:
i = i
This servers no purpose whatsoever, you should skip the else instruction.
Fourth, the most apparent bug:
In your for loop you are editing i variable which is a copy of a single character from the string that you are iterating over. You should perform the change on the original string, this could be done with the usage of enumerate - only if you first turned your string into a list, so that you could modify its values.
for i, char in enumerate(list(content_string.lower())):
if char in punctuation:
content_string[i] = ' '
Anyway, the goal you are trying to achieve can come down to a one-liner, using a list comprehension and a string join on the resulting list afterwards:
content_string = ''.join([char if char not in punctuation else ' ' for char in content_string.lower()])

Python - Read Tokens from a String

I am reading all tokens (operators, int, str, etc.) from a string
My current code:
import re
expression = "($mat.name == 'sign') AND ($most == 100.0)"
tokens = re.findall("\$+[a-zA-Z]*\.[a-zA-Z]+|[a-zA-Z]+|Not|not|NOT|[=]+|[+/*()-]|[0-9]*\.[0-9]+|[0-9]+", expression)
The current result:
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', 'most', '==', '100.0', ')']
The problem is that, while the regex is correctly matching $mat.name, it matches most instead of $most.
Can you please help me fix the regular expression?
Brief
I'm not exactly sure what you're trying to accomplish, but you're matching most instead of $most because it doesn't contain a dot and your expression is saying match either \$+[a-zA-Z]*\.[a-zA-Z]+ or [a-zA-Z]+ and obviously the $most string doesn't contain a . so it's trying the next match.
Code
See regex in use here
\$*(?:[a-z]*\.)?[a-z]+|not|[+/*()-]|\d*\.\d+|[\d=]+
Note: The above regex simplifies the original regex and is to be used with the i flag (ignore case)
Usage
See code in use here
import re
expression = "($mat.name == 'sign') AND ($most == 100.0)"
tokens = re.findall(r"\$*(?:[a-z]*\.)?[a-z]+|not|[+/*()-]|\d*\.\d+|[\d=]+", expression, re.I)
print tokens
Results
Input
($mat.name == 'sign') AND ($most == 100.0)
Output
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']
Explanation
I made some other changes to your pattern, so I'll explain the whole thing.
Match any of the following
\$+(?:[a-z]*\.)?[a-z]+ Match the following
\$* Match $ literally any number of times
(?:[a-z]*\.)? Match the following zero or one time
[a-z]* Match any number of lowercase ASCII letters
\. Match a literal dot character .
[a-z]+ Match one or more lowercase ASCII letters
not Match this literally
[+/*()-] Match any character in the set
\d*\.\d+ Match the following
\d* Match any number of digits
\. Match a literal dot character .
\d+ Match one or more digits
[\d=]+ Match any character in the set (digits or =) one or more times
You should use the following modified regex instead.
import re
expression = "($mat.name == 'sign') AND ($most == 100.0)"
tokens = re.findall("\$+[a-zA-Z]*\.[a-zA-Z]+|\$+[a-zA-Z]*|[a-zA-Z]+|Not|not|NOT|[=]+|[+/*()-]|[0-9]*\.[0-9]+|[0-9]+", expression)
print(tokens)
Which will print
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']
See a full explanation of the modified regular expression here.
In your second token, you got to check is . is present or not as . isn't mandatory in every variable as such!
>>> re.findall("\$+[a-zA-Z]*\.?[a-zA-Z]+|[a-zA-Z]+|Not|not|NOT|[=]+|[+/*()-]|[0-9]*\.[0-9]+|[0-9]+", expression)
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']
a similar solution goes as follows(simplifying the regex in question) -
>>> re.findall('\$+[A-Z.a-z]+|[a-zA-Z]+|Not|not|NOT|=+|[+/*()-]|[0-9.]+', expression)
['(', '$mat.name', '==', 'sign', ')', 'AND', '(', '$most', '==', '100.0', ')']

Python regular expression split with \W

In Python document, I came across the following code snippet
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
What I am confusing is that \W matches any character which is not a Unicode word character, but ',' is Unicode character. And what does the parentheses mean? I know it match a group but there is only one group in the pattern. Why ', ' is also return?
"any character which is not a Unicode word character" is a character being part of a word: letter or digit basically.
Comma cannot be part of a word.
And comma is included in the resulting list because the split regex is into parentheses (defining a group inside the split regex). That's how re.split works (That's the difference between your 2 code snippets)

Partitioning Multiple special characters in python

I am trying to write a program which reads a paragraph which counts the special characters and words
My input:
list words ="'He came,"
words = list words. partition("'")
for i in words:
list-1. extend(i.split())
print(list-1)
my output looks like this:
["'", 'He', 'came,']
but I want
["'", 'He', 'came', ',']
Can any one help me how to do this?
I am trying to write a program which reads a paragraph which counts the special characters and words
Let's focus on the goal then, rather than your approach. Your approach is possible probably possible but it may take a bunch of splits so let's just ignore it for now. Using re.findall and a lengthy filtered regex should work much better.
lst = re.findall(r"\w+|[^\w\s]", some_sentence)
Would make sense. Broken down it does:
pat = re.compile(r"""
\w+ # one or more word characters
| # OR
[^\w\s] # exactly one character that's neither a word character nor whitespace
""", re.X)
results = pat.findall('"Why, hello there, Martha!"')
# ['"', 'Why', ',', 'hello', 'there', ',', 'Martha', '!', '"']
However then you have to go through another iteration of your list to count the special characters! Let's separate them, then. Luckily this is easy -- just add capturing braces.
new_pat = re.compile(r"""
( # begin capture group
\w+ # one or more word characters
) # end capturing group
| # OR
( # begin capture group
[^\w\s] # exactly one character that's neither a word character nor whitespace
) # end capturing group
""", re.X)
results = pat.findall('"Why, hello there, Martha!"')
# [('', '"'), ('Why', ''), ('', ','), ('hello', ''), ('there', ''), ('', ','), ('Martha', ''), ('', '!'), ('', '"')]
grouped_results = {"words":[], "punctuations":[]}
for word,punctuation in results:
if word:
grouped_results['words'].append(word)
if punctuation:
grouped_results['punctuations'].append(punctuation)
# grouped_results = {'punctuations': ['"', ',', ',', '!', '"'],
# 'words': ['Why', 'hello', 'there', 'Martha']}
Then just count your dict keys.
>>> for key in grouped_results:
print("There are {} items in {}".format(
len(grouped_results[key]),
key))
There are 5 items in punctuations
There are 4 items in words

Categories