Error when creating a simple custom dynamic tokenizer in Python - python

I am trying to create a dynamic tokenizer, but it does not work as intended.
Below is my code:
import re
def tokenize(sent):
splitter = re.findall("\W",sent)
splitter = list(set(splitter))
for i in sent:
if i in splitter:
sent.replace(i, "<SPLIT>"+i+"<SPLIT>")
sent.split('<SPLIT>')
return sent
sent = "Who's kid are you? my ph. is +1-6466461022.Bye!"
tokens = tokenize(sent)
print(tokens)
This does not work!
I expected it to return the below list:
["Who", "'s", "kid", "are", "you","?", "my" ,"ph",".", "is", "+","1","-",6466461022,".","Bye","!"]

This would be pretty trivial if it weren't for the special treatment of the '. I'm assuming you're doing NLP, so you want to take into account which "side" the ' belongs to. For instance, "tryin'" should not be split and neither should "'tis" (it is).
import re
def tokenize(sent):
split_pattern = rf"(\w+')(?:\W+|$)|('\w+)|(?:\s+)|(\W)"
return [word for word in re.split(split_pattern, sent) if word]
sent = (
"Who's kid are you? my ph. is +1-6466461022.Bye!",
"Tryin' to show how the single quote can belong to either side",
"'tis but a regex thing + don't forget EOL testin'",
"You've got to love regex"
)
for item in sent:
print(tokenize(item))
The python re lib evaluates patterns containing | from left to right and it is non-greedy, meaning it stops as soon as a match is found, even though it's not the longest match.
Furthermore, a feature of the re.split() function is that you can use match groups to retain the patterns/matches you're splitting at (otherwise the string is split and the matches where the splits happen are dropped).
Pattern breakdown:
(\w+')(?:\W+|$) - words followed by a ' with no word characters immediately following it. E.g., "tryin'", "testin'". Don't capture the non-word characters.
('\w+) - ' followed by at least one word character. Will match "'t" and "'ve" in "don't" and "they've", respectively.
(?:\s+) - split on any whitespace, but discard the whitespace itself
(\W) - split on all non-word characters (no need to bother finding the subset that's present in the string itself)

You can use
[x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x]
See the regex demo. The pattern matches
( - Group 1 (as these texts are captured into a group these matches appear in the resulting list):
[^'\w\s] - any char other than ', word and whitespace char
| - or
'(?![^\W\d_]) - a ' not immediately followed with a letter ([^\W\d_] matches any Unicode letter)
| - or
(?<![^\W\d_])' - a ' not immediately preceded with a letter
) - end of the group
| - or
(?='(?<=[^\W\d_]')(?=[^\W\d_])) - a location right before a ' char that is enclosed with letters
| - or
\s+ - one or more whitespace chars.
See the Python demo:
import re
sents = ["Who's kid are you? my ph. is +1-6466461022.Bye!", "Who's kid are you? my ph. is +1-6466461022.'Bye!'"]
for sent in sents:
print( [x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x] )
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', 'Bye', '!']
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', "'", 'Bye', '!', "'"]

Related

How to remove every special character except for the hyphen and the apostrophe inside and between words words?

As an example i already managed to break the sentence
"That's a- tasty tic-tac. Or -not?" into an array of words like this:
words['That's', 'a-', 'tasty', 'tic-tac.','Or', '-not?'].
Now i have to remove every special character i don't need and get this: words['That's', 'a', 'tasty', 'tic-tac','Or', 'not']
my actual current code looks like this:
pattern = re.compile('[\W_]+')
for x in range(0, file_text.__len__()):
for y in range(0, file_text[x].__len__()):
word_list.append(pattern.sub('', file_text[x][y]))
I have a whole text that i first turn into lines and words and then into just words
You can use
r"\b([-'])\b|[\W_]"
See the regex demo (the demo is a bit modified so that [\W_] could not match newlines as the input at the demo site is a single multiline string).
Regex details
\b([-'])\b - a - or ' that are enclosed with word chars (letters, digits or underscores) (NOTE you may require to only exclude matching these symbols when enclosed with letters if you use (?<=[^\W\d_])([-'])(?=[^\W\d_]))
| - or
[\W_] - any char other than a letter or a digit.
See the Python demo:
import re
words = ["That's", 'a-', 'tasty', 'tic-tac.','Or', '-not?']
rx = re.compile(r"\b([-'])\b|[\W_]")
print( [rx.sub(r'\1', x) for x in words] )
# => ["That's", 'a', 'tasty', 'tic-tac', 'Or', 'not']

How to split sentence to words with regular expression?

"She's so nice!" -> ["she","'","s","so","nice","!"]
I want to split sentence like this!
so I wrote the code, but It includes white space!
How to make code only using regular expression?
words = re.findall('\W+|\w+')
-> ["she", "'","s", " ", "so", " ", "nice", "!"]
words = [word for word in words if not word.isspace()]
Regex: [A-Za-z]+|[^A-Za-z ]
In [^A-Za-z ] add chars you don't want to match.
Details:
[] Match a single character present in the list
[^] Match a single character NOT present in the list
+ Matches between one and unlimited times
| Or
Python code:
text = "She's so nice!"
matches = re.findall(r'[A-Za-z]+|[^A-Za-z ]', text)
Output:
['She', "'", 's', 'so', 'nice', '!']
Code demo
Python's re module doesn't allow you to split on zero-width assertions. You can use python's pypi regex package instead (ensuring you specify to use version 1, which properly handles zero-width matches).
See code in use here
import regex
s = "She's so nice!"
x = regex.split(r"\s+|\b(?!^|$)", s, flags=regex.VERSION1)
print(x)
Output: ['She', "'", 's', 'so', 'nice', '!']
\s+|\b(?!^|$) Match either of the following options
\s+ Match one or more whitespace characters
\b(?!^|$) Assert position as a word boundary, but not at the beginning or end of the line

What's the difference between([])+ and []+?

>>> sentence = "Thomas Jefferson began building Monticello at the age of 26."
>>> tokens1 = re.split(r"([-\s.,;!?])+", sentence)
>>> tokens2 = re.split(r"[-\s.,;!?]+", sentence)
>>> tokens1 = ['Thomas', ' ', 'Jefferson', ' ', 'began', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']
>>> tokens2 = ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']
Can you explain the purpose of ( and )?
(..) in a regex denotes a capturing group (aka "capturing parenthesis"). They are used when you want to extract values out of a pattern. In this case, you are using re.split function which behaves in a specific way when the pattern has capturing groups. According to the documentation:
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses
are used in pattern, then the text of all groups in the pattern are
also returned as part of the resulting list.
So normally, the delimiters used to split the string are not present in the result, like in your second example. However, if you use (), the text captured in the groups will also be in the result of the split. This is why you get a lot of ' ' in the first example. That is what is captured by your group ([-\s.,;!?]).
With a capturing group (()) in the regex used to split a string, split will include the captured parts.
In your case, you are splitting on one or more characters of whitespace and/or punctuation, and capturing the last of those characters to include in the split parts, which seems kind of a weird thing to do. I'd have expected you might want to capture all of the separator, which would look like r"([-\s.,;!?]+)" (capturing one or more characters whitespace/punctuation characters, rather than matching one or more but only capturing the last).

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

Partitioning Multiple special characters in python

I am trying to write a program which reads a paragraph which counts the special characters and words
My input:
list words ="'He came,"
words = list words. partition("'")
for i in words:
list-1. extend(i.split())
print(list-1)
my output looks like this:
["'", 'He', 'came,']
but I want
["'", 'He', 'came', ',']
Can any one help me how to do this?
I am trying to write a program which reads a paragraph which counts the special characters and words
Let's focus on the goal then, rather than your approach. Your approach is possible probably possible but it may take a bunch of splits so let's just ignore it for now. Using re.findall and a lengthy filtered regex should work much better.
lst = re.findall(r"\w+|[^\w\s]", some_sentence)
Would make sense. Broken down it does:
pat = re.compile(r"""
\w+ # one or more word characters
| # OR
[^\w\s] # exactly one character that's neither a word character nor whitespace
""", re.X)
results = pat.findall('"Why, hello there, Martha!"')
# ['"', 'Why', ',', 'hello', 'there', ',', 'Martha', '!', '"']
However then you have to go through another iteration of your list to count the special characters! Let's separate them, then. Luckily this is easy -- just add capturing braces.
new_pat = re.compile(r"""
( # begin capture group
\w+ # one or more word characters
) # end capturing group
| # OR
( # begin capture group
[^\w\s] # exactly one character that's neither a word character nor whitespace
) # end capturing group
""", re.X)
results = pat.findall('"Why, hello there, Martha!"')
# [('', '"'), ('Why', ''), ('', ','), ('hello', ''), ('there', ''), ('', ','), ('Martha', ''), ('', '!'), ('', '"')]
grouped_results = {"words":[], "punctuations":[]}
for word,punctuation in results:
if word:
grouped_results['words'].append(word)
if punctuation:
grouped_results['punctuations'].append(punctuation)
# grouped_results = {'punctuations': ['"', ',', ',', '!', '"'],
# 'words': ['Why', 'hello', 'there', 'Martha']}
Then just count your dict keys.
>>> for key in grouped_results:
print("There are {} items in {}".format(
len(grouped_results[key]),
key))
There are 5 items in punctuations
There are 4 items in words

Categories