How to split sentence to words with regular expression? - python

"She's so nice!" -> ["she","'","s","so","nice","!"]
I want to split sentence like this!
so I wrote the code, but It includes white space!
How to make code only using regular expression?
words = re.findall('\W+|\w+')
-> ["she", "'","s", " ", "so", " ", "nice", "!"]
words = [word for word in words if not word.isspace()]

Regex: [A-Za-z]+|[^A-Za-z ]
In [^A-Za-z ] add chars you don't want to match.
Details:
[] Match a single character present in the list
[^] Match a single character NOT present in the list
+ Matches between one and unlimited times
| Or
Python code:
text = "She's so nice!"
matches = re.findall(r'[A-Za-z]+|[^A-Za-z ]', text)
Output:
['She', "'", 's', 'so', 'nice', '!']
Code demo

Python's re module doesn't allow you to split on zero-width assertions. You can use python's pypi regex package instead (ensuring you specify to use version 1, which properly handles zero-width matches).
See code in use here
import regex
s = "She's so nice!"
x = regex.split(r"\s+|\b(?!^|$)", s, flags=regex.VERSION1)
print(x)
Output: ['She', "'", 's', 'so', 'nice', '!']
\s+|\b(?!^|$) Match either of the following options
\s+ Match one or more whitespace characters
\b(?!^|$) Assert position as a word boundary, but not at the beginning or end of the line

Related

Error when creating a simple custom dynamic tokenizer in Python

I am trying to create a dynamic tokenizer, but it does not work as intended.
Below is my code:
import re
def tokenize(sent):
splitter = re.findall("\W",sent)
splitter = list(set(splitter))
for i in sent:
if i in splitter:
sent.replace(i, "<SPLIT>"+i+"<SPLIT>")
sent.split('<SPLIT>')
return sent
sent = "Who's kid are you? my ph. is +1-6466461022.Bye!"
tokens = tokenize(sent)
print(tokens)
This does not work!
I expected it to return the below list:
["Who", "'s", "kid", "are", "you","?", "my" ,"ph",".", "is", "+","1","-",6466461022,".","Bye","!"]
This would be pretty trivial if it weren't for the special treatment of the '. I'm assuming you're doing NLP, so you want to take into account which "side" the ' belongs to. For instance, "tryin'" should not be split and neither should "'tis" (it is).
import re
def tokenize(sent):
split_pattern = rf"(\w+')(?:\W+|$)|('\w+)|(?:\s+)|(\W)"
return [word for word in re.split(split_pattern, sent) if word]
sent = (
"Who's kid are you? my ph. is +1-6466461022.Bye!",
"Tryin' to show how the single quote can belong to either side",
"'tis but a regex thing + don't forget EOL testin'",
"You've got to love regex"
)
for item in sent:
print(tokenize(item))
The python re lib evaluates patterns containing | from left to right and it is non-greedy, meaning it stops as soon as a match is found, even though it's not the longest match.
Furthermore, a feature of the re.split() function is that you can use match groups to retain the patterns/matches you're splitting at (otherwise the string is split and the matches where the splits happen are dropped).
Pattern breakdown:
(\w+')(?:\W+|$) - words followed by a ' with no word characters immediately following it. E.g., "tryin'", "testin'". Don't capture the non-word characters.
('\w+) - ' followed by at least one word character. Will match "'t" and "'ve" in "don't" and "they've", respectively.
(?:\s+) - split on any whitespace, but discard the whitespace itself
(\W) - split on all non-word characters (no need to bother finding the subset that's present in the string itself)
You can use
[x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x]
See the regex demo. The pattern matches
( - Group 1 (as these texts are captured into a group these matches appear in the resulting list):
[^'\w\s] - any char other than ', word and whitespace char
| - or
'(?![^\W\d_]) - a ' not immediately followed with a letter ([^\W\d_] matches any Unicode letter)
| - or
(?<![^\W\d_])' - a ' not immediately preceded with a letter
) - end of the group
| - or
(?='(?<=[^\W\d_]')(?=[^\W\d_])) - a location right before a ' char that is enclosed with letters
| - or
\s+ - one or more whitespace chars.
See the Python demo:
import re
sents = ["Who's kid are you? my ph. is +1-6466461022.Bye!", "Who's kid are you? my ph. is +1-6466461022.'Bye!'"]
for sent in sents:
print( [x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x] )
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', 'Bye', '!']
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', "'", 'Bye', '!', "'"]

How to remove every special character except for the hyphen and the apostrophe inside and between words words?

As an example i already managed to break the sentence
"That's a- tasty tic-tac. Or -not?" into an array of words like this:
words['That's', 'a-', 'tasty', 'tic-tac.','Or', '-not?'].
Now i have to remove every special character i don't need and get this: words['That's', 'a', 'tasty', 'tic-tac','Or', 'not']
my actual current code looks like this:
pattern = re.compile('[\W_]+')
for x in range(0, file_text.__len__()):
for y in range(0, file_text[x].__len__()):
word_list.append(pattern.sub('', file_text[x][y]))
I have a whole text that i first turn into lines and words and then into just words
You can use
r"\b([-'])\b|[\W_]"
See the regex demo (the demo is a bit modified so that [\W_] could not match newlines as the input at the demo site is a single multiline string).
Regex details
\b([-'])\b - a - or ' that are enclosed with word chars (letters, digits or underscores) (NOTE you may require to only exclude matching these symbols when enclosed with letters if you use (?<=[^\W\d_])([-'])(?=[^\W\d_]))
| - or
[\W_] - any char other than a letter or a digit.
See the Python demo:
import re
words = ["That's", 'a-', 'tasty', 'tic-tac.','Or', '-not?']
rx = re.compile(r"\b([-'])\b|[\W_]")
print( [rx.sub(r'\1', x) for x in words] )
# => ["That's", 'a', 'tasty', 'tic-tac', 'Or', 'not']

Match words that don't start with a certain letter using regex

I am learning regex but have not been able to find the right regex in python for selecting characters that start with a particular alphabet.
Example below
text='this is a test'
match=re.findall('(?!t)\w*',text)
# match returns
['his', '', 'is', '', 'a', '', 'est', '']
match=re.findall('[^t]\w+',text)
# match
['his', ' is', ' a', ' test']
Expected : ['is','a']
With regex
Use the negative set [^\Wt] to match any alphanumeric character that is not t. To avoid matching subsets of words, add the word boundary metacharacter, \b, at the beginning of your pattern.
Also, do not forget that you should use raw strings for regex patterns.
import re
text = 'this is a test'
match = re.findall(r'\b[^\Wt]\w*', text)
print(match) # prints: ['is', 'a']
See the demo here.
Without regex
Note that this is also achievable without regex.
text = 'this is a test'
match = [word for word in text.split() if not word.startswith('t')]
print(match) # prints: ['is', 'a']
You are almost on the right track. You just forgot \b (word boundary) token:
\b(?!t)\w+
Live demo

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

python regular expressions -- split on non-word char or consecutive dashes, but not on single dash

I wish to split a sentence into a list of words on non-word characters (excluding dash, which likely means a hyphen) and consecutive dashes. What I mean is: "merry-go-round" is one word, not three words; "condition--but" is two words: remove the consecutive dashes.
I tried the following and it doesn't work:
listofwords = [word for word in re.split('[^a-zA-Z0-9]|-{2,}',sentence)]
I can provide a sample sentence:
sentence = 'sample sentence---such as well-being {\t'
and the desired result is ['sample', 'sentence', 'such', 'as', 'well-being'].
You can use this regex:
\w+(?:-\w+)*
RegEx Demo
Code:
p = re.compile(r'\w+(?:-\w+)*')
test_str = "sample sentence---such as well-being { "
re.findall(p, test_str)
Output:
['sample', 'sentence', 'such', 'as', 'well-being']

Categories