Python Regex for End of Line

Python Regex for End of Line - python

I am trying to write a regex which adds a space before and after a dot.
However I only want this if there is a space or end of line after the dot.
However I am unable to do so for end of line cases.
Eg.
I want a hotel. >> I want a hotel .
my email is zob#gmail.com >> my email is zob#gmail.com
I have to play. bye! >> I have to play . bye!
Following is my code:
# If "Dot and space" after word or number put space before and after
utterance = re.sub(r'(?<=[a-z0-9])[.][ $]',' . ',utterance)
How do I correct my regex to make sure my 1st example above also works, I tried putting a $ sign in square bracket but that doesn't work.

The main issue is that $ inside a character class denotes a literal $ symbol, you just need a grouping construct here.
I suggest using the following code:
import re
regex = r"([^\W_])\.(?:\s+|$)"
ss = ["I want a hotel.","my email is zob#gmail.com", "I have to play. bye!"]
for s in ss:
result = re.sub(regex, r"\1 . ", s).rstrip()
print(result)
See the Python demo.
If you need to apply this on lines only without affecting line breaks, you can use
import re
regex = r"([^\W_])\.(?:[^\S\n\r]+|$)"
text = "I want a hotel.\nmy email is zob#gmail.com\nI have to play. bye!"
print( re.sub(regex, r"\1 . ", text, flags=re.M).rstrip() )
See this Python demo.
Output:
I want a hotel .
my email is zob#gmail.com
I have to play . bye!
Details:
([^\W_]) - Group 1 matching any letter or digit
\. - a literal dot
(?:\s+|$) - a grouping matching either 1+ whitespaces or end of string anchor (here, $ matches the end of string.)
The rstrip will remove the trailing space added during replacement.
If you are using Python 3, the [^\W_] will match all Unicode letters and digits by default. In Python 2, re.U flag will enable this behavior.
Note that \s+ in the last (?:\s+|$) will "shrink" multiple whitespaces into 1 space.

Use the lookahead assertion (?=) to find a . followed by space or end of line \n:
utterance = re.sub('\\.(?= )|\\.(?=\n)', ' . ', utterance )

[ $] defines a class of characters consisting of a space and a dollar sign, so it matches on space or dollar (literally). To match on space or end of line, use ( |$) (in this case, $ keeps it special meaning.

Related

how can I perform conditional splitting with exceptions in python

I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \s in [A-Z]+\. Name do not split

You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo. Details:
(?<=[.?!]) - a positive lookbehind that requires ., ? or ! immediately to the left of the current location
(?<![A-Z]\.(?=\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead, that is why it works with Python re, and \s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \s+ pattern below)
\s+ - one or more whitespace chars.
See the Python demo:
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.

import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])

The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.

If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.

With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704

You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

Regex to extract top level domain from email address

From email address like
xxx#site.co.uk
xxx#site.uk
xxx#site.me.uk
I want to write a regex which should return 'uk' is all the cases.
I have tried
'+#([^.]+)\..+'
which gives only the domain name. I have tried using
'[^/.]+$'
but it is giving error.

The regex to extract what you are asking for is:
\.([^.\n\s]*)$ with /gm modifiers
explanation:
\. matches the character . literally
1st Capturing group ([^.\n\s]*)
[^.\n\s]* match a single character not present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
. the literal character .
\n matches a fine-feed (newline) character (ASCII 10)
\s match any white space character [\r\n\t\f ]
$ assert position at end of a line
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
g modifier: global. All matches
for your input example, it will be:
import re
m = re.compile(r'\.([^.\n\s]*)$', re.M)
f = re.findall(m, data)
print f
output:
['uk', 'uk', 'uk']
hope this helps.

As myemail#com is a valid address, you can use:
#.*([^.]+)$

You don't need regex. This would always give you 'uk' in your examples:
>>> url = 'foo#site.co.uk'
>>> url.split('.')[-1]
'uk'

Simply .*\.(\w+) won't help?
Can add more validations for "#" to the regular expression if needed.

Python regex to extract tokens

I am trying to find all the tokens which look either like abc_rty or abc_45 or abc09_23k or abc09-K34 or 4535. The tokens shouldn't start with _ or - or numbers.
I am not making any progress and have even lost the progress that I did. This is what I have now:
r'(?<!0-9)[(a-zA-Z)+]_(?=a-zA-Z0-9)|(?<!0-9)[(a-zA-Z)+]-(?=a-zA-Z0-9)\w+'
To make the question more clear here is an example:
If i have a string as follows:
D923-44 43 uou 08*) %%5 89ANB -iopu9 _M89 _97N hi_hello
Then it shall accept
D923-44 and 43 and uou and hi_hello
It should ignore
08*) %%5 89ANB -iopu9 _M89 _97N
I might have missed some cases but i think the text would be enough. Apologies if its not

^(\d+|[A-Za-z][\w_-]*)$
Edit live on Debuggex
split the line with a space delimiter then run this REGEX through the line to filter.
^ is the start of the line
\d means digits [0-9]
+ means one or more
| means OR
[A-Za-z] first character must be a letter
[\w_-]* There can be any alphanumeric _ + character after it or nothing at all.
$ means the end of the line
The flow of the REGEX is shown in the chart I provided, which somewhat explains how it's happening.
However, ill explain basically it checks to see if it's all digits OR it starts with a letter(upper/lower) then after that letter it checks for any alphanumeric _ + character until the end of the line.

This appears to work as desired:
regex = re.compile(r"""
(?<!\S) # Assert there is no non-whitespace before the current character
(?: # Start of non-capturing group:
[^\W\d_] # Match either a letter
[\w-]* # followed by any number of the allowed characters
| # or
\d+ # match a string of digits.
) # End of group
(?!\S) # Assert there is no non-whitespace after the current character""",
re.VERBOSE)
See it on regex101.com.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex for End of Line - python

Use the lookahead assertion (?=) to find a . followed by space or end of line \n: utterance = re.sub('\\.(?= )|\\.(?=\n)', ' . ', utterance )

[ $] defines a class of characters consisting of a space and a dollar sign, so it matches on space or dollar (literally). To match on space or end of line, use ( |$) (in this case, $ keeps it special meaning.

Related

how can I perform conditional splitting with exceptions in python

regex capture info in text file after multiple blank lines

How to match numeric characters with no white space following

Regex to extract top level domain from email address

Python regex to extract tokens

Categories

Resources