Regex : catch one character but not more - python

I'm trying to find a regexp that catches all instances that contain one and only one \n and any number of (space), in a string that might also contain instances with multiple \n. So, for instance (with spaces denoted with _):
Should be caught:
\n
_\n
\n_
_\n_
Should *not* be caught, not even the first \n:
_
___
\n\n\n\n
\n\n\n_\n\n
_\n\n
\n\n_
_\n\n_
_\n\n_\n
\n_\n_
_\n_\n
_\n\n_\n_
___\n__\n and so on...
(Using re in pyton3 on Windows10)
I'm trying to find a regexp that catches all instances that contain one and only one \n and any number of (space), in a string that might also contain instances with multiple \n. So, for instance (with spaces denoted with _):
Should be caught:
\n
_\n
\n_
_\n_
Should *not* be caught, not even the first \n:
_
___
\n\n\n\n
\n\n\n_\n\n
_\n\n
\n\n_
_\n\n_
_\n\n_\n
\n_\n_
_\n_\n
_\n\n_\n_
___\n__\n and so on...
(Using re in pyton3 on Windows10)
Edit to clarify the context: I'm parsing the text of a web page and I have a block of text in a string, that looks like that:
Word word word. Word word word word word. \n Word word word word word word. Word word word word. \n\n \nWord word word word word. \nWord word word. Word word word.
In the subsequent steps of my code, I'm using a function that gets rid of any \n, so I want to detect where they are before using this function, so I can keep them (by replacing them temporarily with special characters that won't disappear). But as you can see, I have two cases :
1) Multiple \n indicate a break of paragraphs, but I have no way to be sure that they follow each other without spaces or tabs between them. I want to catch them to replace them with a special character (like § for instance) that will let me know later where to put back multiple \n. It only matters that I know there are 2 or more \n, not how many there are. At the moment, I'm using this (but please do tell me if there is a bug):
text = re.sub(r"[ \t]*(?:\n[ \t]*){2,}", "$", text)
2) Single \n indicate a line break within a paragraph. These are what I want to single out, without catching the instances of the previous case. Again, it's to replace them with a special character (say |) to put it back later:
text = re.sub(r" the_regex_I'm_looking_for ", "|", text)
(I know I could do the first replacement, and then search for the remaining \n, but for reasons that would be largely irrelevant here and long to explain, I can't.)
2nd edit: So, for instance, the desired result in this case would be:
Word word word. Word word word word word. | Word word word word word word. Word word word word. $ Word word word word word. | Word word word. Word word word.
(I'd rather have no spaces before and after the § and the |, but here I'm forced to put them for the bold formatting of StackOverflow, if I don't I get something like **$**that.)

Would the following pattern suit you?
import regex as re
StrVal = r'Word word word. Word word word word word. \n Word word word word word word. Word word word word. \n\n \nWord word word word word. \nWord word word. Word word word.'
StrVal = re.sub(r'(?<!\\n\s*)\s*\\n\s*(?!\s*\\n)', '|', StrVal)
print(StrVal)
Returns:
Word word word. Word word word word word.|Word word word word word word. Word word word word. \n\n \nWord word word word word.|Word word word. Word word word.
So instead of re module, I referenced regex module to make use of non-fixed width quantifier in the negative lookbehind, something re would not allow. So also patterns like \n \n\n \n get no substitution.

Check this demo whether it is ok for you. I have used space instead of "_".
import re
pattern = '^ *\n *$'
test_string = "\n\n "
result = re.findall(pattern, test_string)
print(result)
NB: I used '^\s*\n\s*' but it will not work as \s is equivalent to [\t\n\r\f\v]. so I have used space ' ' character

Related

Regex to Identify Fixed character alphanumeric word from text in python

I have a text file from which i am trying to remove alpha-numeric word of seven characters.
Text1: " I have to remove the following word, **WORD123**, from the text given"
Text2: " I have to remove the following word, **WORD001**, the text given"
So far i tried '\b[A-Za-z0-9]\b' but it doesn't works.
Also , can we add a functionality that it picks only those words which is succeeded by "from"(not actual word, just an example).In the above example it should only pick WORD123, and not WORD001 as the later one is not succeeded by FROM.
You may use re.sub here, e.g.
inp = "I have to remove the following word, WORD123, FROM the text given"
out = re.sub(r'\s*\b[A-Za-z0-9]{7}\b[^\w]*(?=\bfrom)', '', inp, flags=re.IGNORECASE)
print(out)
This prints:
I have to remove the following word,from the text given
Note that the above regex replacement does not match/affect the second sample input sentence you gave, since the 7 letter word is lacking the keyword from as the next word.

using regular expression for searching multiple key words from cells

I have to write a code for searching regular expression from an excel sheet which has sentences grouped together. I have managed to find the key words representing each sentence. When i run the below mention code it finds only one key word from one cell and moves to next cell. I have tried to display the requirement in the table
\bphrase\W+(?:\w+\W+){0,6}?one\b|\bphrase\W+(?:\w+\W+){0,6}?two\b|\bphrase\W+(?:\w+\W+){0,6}?three\b|\bphrase\W+(?:\w+\W+){0,6}?four\b|
The regex:
\b(phrase)\b\W+(?:\w+\W+){0,6}?\b(one|two|three|four)\b
\b(phrase)\b matches phrase on a word boundary.
W+: matches one or more non-word characters (typically spaces).
(?:\w+\W+){0,6}? Matches between 0 and 6 times, as few times as possible, one or more word characters followed by one or more non-word characters.
\b(one|two|three|four)\b Matches one, two, three or four on a word boundary.
The code:
import re
text = "This sentence has phrase one and phrase word word two and phrase word three and phrase four phrase too many words too many words too many words four again."
l = [m[1] + ' ' + m[2] for m in re.finditer(r'\b(phrase)\b\W+(?:\w+\W+){0,6}?\b(one|two|three|four)\b', text)]
print(l)
Prints:
['phrase one', 'phrase two', 'phrase three', 'phrase four']

Python re adding space when splitting

I am using the following regex to split a phrase passed in as a string, into a list of words.
Because there might be other letters, I'm using the UTF flag. This works great in most cases:
phrase = 'hey look out'
word_list = re.split(r'[\W_]+', unicode(phrase, 'utf-8').lower(), flags=re.U)
word_list [u'hey', u'look', u'out']
But, if the phrase is a sentence that ends with a period like this, it will create a blank value in the list:
phrase = 'hey, my spacebar_is broken.'
word_list [u'hey', u'my', u'spacebar', u'is', u'broken', u'']
My work around is to use
re.split(r'[\W_]+', unicode(phrase.strip('.'), 'utf-8').lower(), flags=re.U)
but I wanted to know is there was a way to solve it within the regex expression?
\W selects non-word characters. Since . is a non-word character, the string is split on it. Since there is nothing after the period, you get an empty string. If you want to avoid this, you'll need to either strip the separator characters of the ends of the string
phrase = re.sub(r'^[\W_]+|[\W_]+$', '', phrase)
or filter the resulting array to remove empty strings.
word_list = [word for word in word_list if word]
Alternatively, you can get the words by matching them directly rather than splitting:
words = re.findall(r'[^\W_]+', phrase)

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

Categories