Regex in Python to match words with special characters - python

I have this code
import re
str1 = "These should be counted as a single-word, b**m !?"
match_pattern = re.findall(r'\w{1,15}', str1)
print(match_pattern)
I want the output to be:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
The output should exclude non-words such as the "!?" what are the other validation should I use to match and achieve the desired output?

I would use word boundaries (\b) filled with 1 or more non-space:
match_pattern = re.findall(r'\b\S+\b', str1)
result:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
!? is skipped thanks to word boundary magic, which don't consider that as a word at all either.

Probably you want something like [^\s.!?] instead of \w but what exactly you want is not evident from a single example. [^...] matches a single character which is not one of those between the brackets and \s matches whitespace characters (space, tab, newline, etc).

You can also achieve a similar result not using RegEx:
string = "These should be counted as a single-word, b**m !?"
replacements = ['.',',','?','!']
for replacement in replacements:
if replacement in string:
string = string.replace(replacement, "");
print string.split()
>>> ['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']

Related

How do I tokenize to separate apostrophes?

I have a function made to tokenize the words given; however, the problem ensures in the fact that it does separate 'isn't' and 'o'brian' and similar words. This is the code:
from typing import List
import re
def tokenize(text: str) -> List[str]:
return re.sub(r'(\W+)', r' \1 ', text.lower()).split()
When I put in something like
"He said 'Isn't O'Brian the best?'"
it'll come as
['he', 'said', "'", "isn't", "o'brian", 'the', 'best', '?', "'"],
not
['he', 'said', "'", 'isn', "'", 't', 'o', "'", 'brian', 'the', 'best', "?'"]
I'm really lost because I've tried to do more than re.sub to separate the words and tried to split them, but it seemingly is not working.
You need a better definition of "word", for example:
word_re = r"\b[A-Za-z]+(?:'[A-Za-z]+)?\b"
that is, a word boundary, then some letters, then optionally an apostrophe, followed by letters. Once you've got it, use findall to extract words:
words = [w.lower() for w in re.findall(word_re, text)]
if the apostrophes arent important to the meaning of the word, ie "obrien" and "o'brien" are the same for your purposes you could do some basic text pre-processing to remove all the apostrophes using text.replace("'","") see here
if they do matter, you could replace them with Unicode (the unicode for ' is U+0027) and translate them out of Unicode after you've tokenized them, or just text.replace("'","U+0027") before you tokenize, and text.replace("U+0027","'") after
hope that helps, goodluck!

Regex to parse queries with quoted substrings and return nested lists of individual words

I'm trying to write a regex that takes in a string of words containing quoted substrings like "green lizards" like to sit "in the sun", tokenizes it into words and quoted substrings (using either single or double quotes) separated by spaces, and then returns a list [['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']] where the list items are either single words or nested lists of words where a quoted substrings was encountered.
I am new to regex, and was able to find a solution that captures the quoted parts: re.findall('"([^"]*)"', '"green lizards" like to sit "in the sun"') ... which returns: ['green lizards', 'in the sun']
But this doesn't capture the individual words, and also doesn't tokenize them (returning a single string instead of list of words, which requires me to split() them each separately.
How would I make a regex that correctly returns the type of list I'm wanting? Also, I'm open to better methods/tools than regex for parsing these sorts of strings if anyone has suggestions.
Thanks!
With re.findall() function and built-in str methods:
import re
s = '"green lizards" like to sit "in the sun"'
result = [i.replace('"', "").split() if i.startswith('"') else i
for i in re.findall(r'"[^"]+"|\S+', s)]
print(result)
The output:
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]
Another approach (supporting both single and double quotes):
import re
sentence = """"green lizards" like to sit "in the sun" and 'single quotes' remain alone"""
rx = re.compile(r"""(['"])(.*?)\1|\S+""")
tokens = [m.group(2).split()
if m.group(2) else m.group(0)
for m in rx.finditer(sentence)]
print(tokens)
Yielding
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun'], 'and', ['single', 'quotes'], 'remain', 'alone']
The idea here is:
(['"]) # capture a single or a double quote
(.*?) # 0+ characters lazily
\1 # up to the same type of quote previously captured
| # ...or...
\S+ # not a whitespace
In the list comprehension we check which condition was met.
You can use re.split and then a final str.split:
import re
s = '"green lizards" like to sit "in the sun"'
new_s = [[i[1:-1].split()] if i.startswith('"') else i.split() for i in re.split('(?<=")\s|\s(?=")', s)]
last_result = [i for b in new_s for i in b]
Output:
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

python regular expressions -- split on non-word char or consecutive dashes, but not on single dash

I wish to split a sentence into a list of words on non-word characters (excluding dash, which likely means a hyphen) and consecutive dashes. What I mean is: "merry-go-round" is one word, not three words; "condition--but" is two words: remove the consecutive dashes.
I tried the following and it doesn't work:
listofwords = [word for word in re.split('[^a-zA-Z0-9]|-{2,}',sentence)]
I can provide a sample sentence:
sentence = 'sample sentence---such as well-being {\t'
and the desired result is ['sample', 'sentence', 'such', 'as', 'well-being'].
You can use this regex:
\w+(?:-\w+)*
RegEx Demo
Code:
p = re.compile(r'\w+(?:-\w+)*')
test_str = "sample sentence---such as well-being { "
re.findall(p, test_str)
Output:
['sample', 'sentence', 'such', 'as', 'well-being']

python split a text file function

I wrote a tokenize function that basically reads a string representation and splits it into list of words.
My code:
def tokenize(document):
x = document.lower()
return re.findall(r'\w+', x)
My output:
tokenize("Hi there. What's going on? first-class")
['hi', 'there', 'what', 's', 'going', 'on', 'first', 'class']
Desired Output:
['hi', 'there', "what's", 'going', 'on', 'first-class']
Basically I want the apostrophed words and hypen words to remain as a single word in list along with double quotes. How can i change my function to get the desired output.
\w+ matches one or more word characters; this does not include apostrophes or hyphens.
You need to use a character set here to tell Python exactly what you want to match:
>>> import re
>>> def tokenize(document):
... return re.findall("[A-Za-z'-]+", document)
...
>>> tokenize("Hi there. What's going on? first-class")
['hi', 'there', "what's", 'going', 'on', 'first-class']
>>>
You'll notice too that I removed the x = document.lower() line. This is no longer necessary since we can match uppercase characters by simply adding A-Z to the character set.

Categories