Regex parsing text and get relevant words / characters - python

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.

Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

Related

How to remove every special character except for the hyphen and the apostrophe inside and between words words?

As an example i already managed to break the sentence
"That's a- tasty tic-tac. Or -not?" into an array of words like this:
words['That's', 'a-', 'tasty', 'tic-tac.','Or', '-not?'].
Now i have to remove every special character i don't need and get this: words['That's', 'a', 'tasty', 'tic-tac','Or', 'not']
my actual current code looks like this:
pattern = re.compile('[\W_]+')
for x in range(0, file_text.__len__()):
for y in range(0, file_text[x].__len__()):
word_list.append(pattern.sub('', file_text[x][y]))
I have a whole text that i first turn into lines and words and then into just words
You can use
r"\b([-'])\b|[\W_]"
See the regex demo (the demo is a bit modified so that [\W_] could not match newlines as the input at the demo site is a single multiline string).
Regex details
\b([-'])\b - a - or ' that are enclosed with word chars (letters, digits or underscores) (NOTE you may require to only exclude matching these symbols when enclosed with letters if you use (?<=[^\W\d_])([-'])(?=[^\W\d_]))
| - or
[\W_] - any char other than a letter or a digit.
See the Python demo:
import re
words = ["That's", 'a-', 'tasty', 'tic-tac.','Or', '-not?']
rx = re.compile(r"\b([-'])\b|[\W_]")
print( [rx.sub(r'\1', x) for x in words] )
# => ["That's", 'a', 'tasty', 'tic-tac', 'Or', 'not']

Match words that don't start with a certain letter using regex

I am learning regex but have not been able to find the right regex in python for selecting characters that start with a particular alphabet.
Example below
text='this is a test'
match=re.findall('(?!t)\w*',text)
# match returns
['his', '', 'is', '', 'a', '', 'est', '']
match=re.findall('[^t]\w+',text)
# match
['his', ' is', ' a', ' test']
Expected : ['is','a']
With regex
Use the negative set [^\Wt] to match any alphanumeric character that is not t. To avoid matching subsets of words, add the word boundary metacharacter, \b, at the beginning of your pattern.
Also, do not forget that you should use raw strings for regex patterns.
import re
text = 'this is a test'
match = re.findall(r'\b[^\Wt]\w*', text)
print(match) # prints: ['is', 'a']
See the demo here.
Without regex
Note that this is also achievable without regex.
text = 'this is a test'
match = [word for word in text.split() if not word.startswith('t')]
print(match) # prints: ['is', 'a']
You are almost on the right track. You just forgot \b (word boundary) token:
\b(?!t)\w+
Live demo

How to split sentence to words with regular expression?

"She's so nice!" -> ["she","'","s","so","nice","!"]
I want to split sentence like this!
so I wrote the code, but It includes white space!
How to make code only using regular expression?
words = re.findall('\W+|\w+')
-> ["she", "'","s", " ", "so", " ", "nice", "!"]
words = [word for word in words if not word.isspace()]
Regex: [A-Za-z]+|[^A-Za-z ]
In [^A-Za-z ] add chars you don't want to match.
Details:
[] Match a single character present in the list
[^] Match a single character NOT present in the list
+ Matches between one and unlimited times
| Or
Python code:
text = "She's so nice!"
matches = re.findall(r'[A-Za-z]+|[^A-Za-z ]', text)
Output:
['She', "'", 's', 'so', 'nice', '!']
Code demo
Python's re module doesn't allow you to split on zero-width assertions. You can use python's pypi regex package instead (ensuring you specify to use version 1, which properly handles zero-width matches).
See code in use here
import regex
s = "She's so nice!"
x = regex.split(r"\s+|\b(?!^|$)", s, flags=regex.VERSION1)
print(x)
Output: ['She', "'", 's', 'so', 'nice', '!']
\s+|\b(?!^|$) Match either of the following options
\s+ Match one or more whitespace characters
\b(?!^|$) Assert position as a word boundary, but not at the beginning or end of the line

Python regular expression split with \W

In Python document, I came across the following code snippet
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
What I am confusing is that \W matches any character which is not a Unicode word character, but ',' is Unicode character. And what does the parentheses mean? I know it match a group but there is only one group in the pattern. Why ', ' is also return?
"any character which is not a Unicode word character" is a character being part of a word: letter or digit basically.
Comma cannot be part of a word.
And comma is included in the resulting list because the split regex is into parentheses (defining a group inside the split regex). That's how re.split works (That's the difference between your 2 code snippets)

Stripping punctuation from unique strings in an input file

This question ( Best way to strip punctuation from a string in Python ) deals with stripping punctuation from an individual string. However, I'm hoping to read text from an input file, but only print out ONE COPY of all strings without ending punctuation. I have started something like this:
f = open('#file name ...', 'a+')
for x in set(f.read().split()):
print x
But the problem is that if the input file has, for instance, this line:
This is not is, clearly is: weird
It treats the three different cases of "is" differently, but I want to ignore any punctuation and have it print "is" only once, rather than three times. How do I remove any kind of ending punctuation and then put the resulting string in the set?
Thanks for any help. (I am really new to Python.)
import re
for x in set(re.findall(r'\b\w+\b', f.read())):
should be more able to distinguish words correctly.
This regular expression finds compact groups of alphanumerical characters (a-z, A-Z, 0-9, _).
If you want to find letters only (no digits and no underscore), then replace the \w with [a-zA-Z].
>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']
You can use translation tables if you don't care about replacing your punctuation characters with white space, for eg.
>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = " "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is clearly is weird'
# And for your case of creating a set of unique words.
>>> set('This is not is clearly is weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])

Categories