How to write a regular expression which can handle the following substitution scenario.
Hello, this is a ne-
w line of text wher-
e we are trying hyp-
henation.
i have a short Python code which handles breaking long one_line strings into a multi_line string and produces output similar to the code sample given above
I want a regular expression that takes care of the single hyphenated character like in first and second line and just pulls up the single hyphenated character on the previous like.
something like re.sub("-\n<any character>","<the any character>\n")
I can not find a way on how to handle the hyphenated character
below is some further information about the question
Word = "Python string comparison is performed using the characters in both strings. The characters in both strings are compared one by one."
def hyphenate(word, x):
for i in range(x, len(word), x):
word = word[:i] + ("\n" if (word[i] == " " or word[i-1] == " " ) else "-\n") + (word[i:] if word[i] != " " else word[(i+1):])
return(word)
print(hyphenate(Word, 20))
#Produced output
Python string compar-
ison is performed
using the character- <=
s in both strings.
The characters in b- <=
oth strings are co-
mpared one by one.
#Desired output
Python string compar-
ison is performed
using the characters <=
in both strings.
The characters in <=
both strings are co-
mpared one by one.
You don't need to include the trailing character at all.
re.sub(r'-\n', '')
If for some reason you do need to capture the character, you can use r'\1' to refer back to it.
re.sub(r'-\n([aeiou])', r'\1')
The notation r'...' produces a "raw string" where backslashes only represent themselves. In Python, backslashes in strings are otherwise processed as escapes - for example, '\n' represents the single wharacter newline, whereas r'\n' represents the two literal characters backslash and n (which in a regex match a literal newline).
Related
How to find and remove all the unneeded backslash escapes in Python regular expressions.
For example in r'\{\"*' all the escapes are unnecessary and has the same meaning as r'{"*'. But in r'\[a-b]\{2}\Z\'\+' removing any of the escapes would change how the regex is interpreted by the regex engine (or cause a syntax error).
Given the pattern, is there an easy, i.e. other than perhaps parsing the whole regex string looking for escapes on non-special characters, way to remove escape patterns programmatically in Python?
Here is the code that I came up with:
from contextlib import redirect_stdout
from io import StringIO
from re import compile, DEBUG, error, MULTILINE, VERBOSE
def unescape(pattern: str, flags: int):
"""Remove any escape that does not change the regex meaning"""
strio = StringIO()
with redirect_stdout(strio):
compile(pattern, DEBUG | flags)
original_debug = strio.getvalue()
index = len(pattern)
while index >= 0:
index -= 1
character = pattern[index]
if character != '\\':
continue
removed_escape = pattern[:index] + pattern[index+1:]
strio = StringIO()
with redirect_stdout(strio):
try:
compile(removed_escape, DEBUG | flags)
except error:
continue
if original_debug == strio.getvalue():
pattern = removed_escape
return pattern
def print_unescaped_raw(regex: str, flags:int=0):
"""Print an unescaped raw-string representation for s."""
print(
("r'%s'" % unescape(regex, flags)
.replace("'", r'\'')
.replace('\n', r'\n'))
)
print_unescaped_raw(r'\{\"*') # r'{"*'
One can also use sre_parse.parse directly, but the SubPatterns and tuples in the result may contain nested SubPatterns. And SubPattern instances don't have __eq__ method defined for them, so a recursive comparison subroutine might be required.
P.S.
Unfortunately, this method does not work with the regex module because in regex you get different debug output for escaped characters:
regex.compile(r'{', regex.DEBUG)
LITERAL MATCH '{'
regex.compile(r'\{', regex.DEBUG)
CHARACTER MATCH '{'
Unlike re that gives:
re.compile(r'{', re.DEBUG)
LITERAL 123
re.compile(r'\{', re.DEBUG)
LITERAL 123
I will not do the whole implementation but I can give you some hints to make a viable heuristic/algo:
Initial Hypothesis: You have for each regex that you are going to modify a list of input strings/expected output strings to validate its behavior
Use this website to have the list of characters that should stay escaped with the backslash \ http://www.rexegg.com/regex-quickstart.html and Create a list of elements that should not be replaced
Parse your regex and replace all the \X where X is a character that is not present in the list generated at the previous step by X
Test your initial regex on its input strings and test your new regex on the same input strings and compare their respective outputs for all the result
If all of your results are the same, then you can use your new/simplified regex.
If at least one of the output is different then you have to throw away your new regex and proceed with local replacements: select randomly (round robin could be used) one of the \X in your initial regex that is not in the list that you have construct at step 1. and replace it by X check the output in comparison to the initial regex output for each input string if it matches you can use that regex and repeat step 5. until it is not possible to progress anymore. however, If the output is different for that replacement remove it from the list of elements you might be able to replace and repeat the step 5 with your previous regex. Do the process until your list of possible local replacement is empty, you can use the new regex instead of the old one.
NOTE: This post is not the same as the post "Re.sub not working for me".
That post is about matching and replacing ANY non-alphanumeric substring in a string.
This question is specifically about matching and replacing non-alphanumeric substrings that explicitly show up at the beginning of a string.
The following method attempts to match any non-alphanumeric character string "AT THE BEGINNING" of a string and replace it with a new string "BEGINNING_"
def m_getWebSafeString(self, dirtyAttributeName):
cleanAttributeName = ''.join(dirtyAttributeName)
# Deal with beginning of string...
cleanAttributeName = re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
# Deal with end of string...
if "BEGINNING_" in cleanAttributeName:
print ' ** ** ** D: "{}" ** ** ** C: "{}"'.format(dirtyAttributeName, cleanAttributeName)
PROBLEM DESCRIPTION: The method seems to not only replace non-alphnumeric characters but it also incorrectly inserts the "BEGINNING_" string at the beginning of all strings that are passed into it. In other words...
GOOD RESULT: If the method is passed the string *##$ThisIsMyString1, it correctly returns BEGINNING_ThisIsMyString1
BAD/UNWANTED RESULT: However, if the method is passed the string ThisIsMyString2 it incorrectly (and always) inserts the replacement string (BEGINNING_), even there are no non-alphanumeric characters, and yields the result BEGINNING_ThisIsMyString2
MY QUESTION: What is the correct way to write the re.sub() line so it only replaces those non-alphnumeric characters at the beginning of the string such that it does not always insert the replacement string at the beginning of the original input string?
You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with
re.sub('^[^a-zA-Z]+', ...)
to ensure that only 1 or more instances are matched.
replace
re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
with
re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)
There is a more elegant solution. You can use this
re.sub('^\W+', 'BEGINNING_', cleanAttributeName)
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
>>> re.sub('^\W+', 'BEGINNING_', '##$ThisIsMyString1')
'BEGINNING_ThisIsMyString1'
>>> re.sub('^\W+', 'BEGINNING_', 'ThisIsMyString2')
'ThisIsMyString2'
I'm trying to write a regular expression in python, and one of the characters involved in it is the \001 character. putting \001 in a string doesn't seem to work. I also tried 'string' + str(chr(1)), but the regex doesn't seem to catch it. Please for the love of god somebody help me, I've been struggling with this all day.
import sys
import postgresql
import re
if len(sys.argv) != 2:
print("usage: FixToDb <fix log file>")
else:
f = open(sys.argv[1], 'r')
timeExp = re.compile(r'(\d{2}):(\d{2}):(\d{2})\.(\d{6}) (\S)')
tagExp = re.compile('(\\d+)=(\\S*)\001')
for line in f:
#parse the time
m = timeExp.match(line)
print(m.group(1) + ':' + m.group(2) + ':' + m.group(3) + '.' + m.group(4) + ' ' + m.group(5));
tagPairs = re.findall('\\d+=\\S*\001', line)
for t in tagPairs:
tagPairMatch = tagExp.match(t)
print ("tag = " + tagPairMatch.group(1) + ", value = " + tagPairMatch.group(2))
Here's is an example line of for the input. I replaced the '\001' character with a '~' for readability
15:32:36.357227 R 1 0 0 0 8=FIX.4.2~9=0067~35=A~52=20120713-19:32:36~34=1~49=PD~56=P~98=0~108=30~10=134
output:
15:32:36.357227 R
tag = 8, value = FIX.4.29=006735=A52=20120713-19:32:3634=149=PD56=P98=0108=3010=134
So it doesn't stop at the '\001' character.
chr(1) should work, as will "\x01", as will "\001". (Note that chr(1) already returns a string, so you don't need to do str(chr(1)).) In your example it looks like you have both "\001" and chr(1), so that won't work unless you have two of the characters in a row in your data.
You say the regex "doesn't seem to catch it", but you don't give an example of your input data, so it's impossible to say why.
Edit; Okay, it looks like the problem has nothing to do with the \001. It is the classic greediness problem. The \S* in your tagExp expression will match a \001 character (since that character is not whitespace. So the \S* is gobbling the entire line. Use \S*? to make it non-greedy.
Edit: As others have noted, it also looks like your backslashes are awry. In regular expressions you face a backslash-doubling problem: Python uses the backslash for its own string escapes (like \t for tab, \n for newline), but regular expressions also use the backslash for their own purposes (e.g., \s for whitespace). The usual solution is to use raw strings, but you can't do that if you want to use the "\001" escape. However, you could use raw strings for your timeExp regex. Then in your other regexes, double the backslashes (except on \001, because you want that one to be interpreted as a character-code escape).
Instead of using \S to match the value, which can be any non-whitespace character, including \001, you should use [^\x01], which will match any character that is not \001.
#Sam Mussmann, no...
1 (decimal) = \001 (octal) <> \x01 (UNICODE)
I'm trying to learn python, and I'm pretty new at it, and I can't figure this one part out.
Basically, what I'm doing now is something that takes the source code of a webpage, and takes out everything that isn't words.
Webpages have a lot of \n and \t, and I want something that will find \ and delete everything between it and the next ' '.
def removebackslash(source):
while(source.find('\') != -1):
startback = source.find('\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
is what I have. It doesn't work like this, because the \' doesn't close the string, but when I change \ to \\, it interprets the string as \\. I can't figure out anything that is interpreted at '\'
\ is an escape character; it either gives characters a special meaning or takes said special meaning away. Right now, it's escaping the closing single quote and treating it as a literal single quote. You need to escape it with itself to insert a literal backslash:
def removebackslash(source):
while(source.find('\\') != -1):
startback = source.find('\\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
Try using replace:
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
So in your case:
my_text = my_text.replace('\n', '')
my_text = my_text.replace('\t', '')
As others have said, you need to use '\\'. The reason you think this isn't working is because when you get the results, they look like they begin with two backslashes. But they don't begin with two backslashes, it's just that Python shows two backslashes. If it didn't, you couldn't tell the difference between a newline (represented as \n) and a backslash followed by the letter n (represented as \\n).
There are two ways to convince yourself of what's really going on. One is to use print on the result, which causes it to expand the escapes:
>>> x = "here is a backslash \\ and here comes a newline \n this is on the next line"
>>> x
u'here is a backslash \\ and here comes a newline \n this is on the next line'
>>> print x
here is a backslash \ and here comes a newline
this is on the next line
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ and here comes a newline \n this is on the next line'
>>> print x[startback:]
\ and here comes a newline
this is on the next line
Another way is to use len to verify the length of the string:
>>> x = "Backslash \\ !"
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ !'
>>> print x[startback:]
\ !
>>> len(x[startback:])
3
Notice that len(x[startback:]) is 3. The string contains three characters: backslash, space, and exclamation point. You can see what's going on even more simply by just looking at a string that contains only a backslash:
>>> x = "\\"
>>> x
u'\\'
>>> print x
\
>>> len(x)
1
x only looks like it starts with two backslashes when you evaluate it at the interactive prompt (or otherwise use it's __repr__ method). When you actually print it, you can see it's only one backslash, and when you look at its length, you can see it's only one character long.
So what this means is you need to escape the backslash in your find, and you need to recognize that the backslashes displayed in the output may also be doubled.
The SO auto-format shows your problem. Since \ is used to escape characters, it's escaping the end quotes. Try changing that line to (note the use of double quotes):
while(source.find("\\") != -1):
Read more about escape characters in the docs.
I don't think anyone's mentioned this yet, but if you don't want to deal with having to escape characters just use a raw string.
source.find(r'\')
Adding the letter r before the string tells Python not to interpret any special characters and keeps the string exactly as you type it.
I need to be able to tell the difference between a string that can contain letters and numbers, and a string that can contain numbers, colons and hyphens.
>>> def checkString(s):
... pattern = r'[-:0-9]'
... if re.search(pattern,s):
... print "Matches pattern."
... else:
... print "Does not match pattern."
# 3 Numbers seperated by colons. 12, 24 and minus 14
>>> s1 = "12:24:-14"
# String containing letters and string containing letters/numbers.
>>> s2 = "hello"
>>> s3 = "hello2"
When I run the checkString method on each of the above strings:
>>>checkString(s1)
Matches Pattern.
>>>checkString(s2)
Does not match Pattern.
>>>checkString(s3)
Matches Pattern
s3 is the only one that doesn't do what I want. I'd like to be able to create a regex that allows numbers, colons and hyphens, but excludes EVERYTHING else (or just alphabetical characters). Can anyone point me in the right direction?
EDIT:
Therefore, I need a regex that would accept:
229 // number
187:657 //two numbers
187:678:-765 // two pos and 1 neg numbers
and decline:
Car //characters
Car2 //characters and numbers
you need to match the whole string, not a single character as you do at the moment:
>>> re.search('^[-:0-9]+$', "12:24:-14")
<_sre.SRE_Match object at 0x01013758>
>>> re.search('^[-:0-9]+$', "hello")
>>> re.search('^[-:0-9]+$', "hello2")
To explain regex:
within square brackets (character class): match digits 0 to 9, hyphen and colon, only once.
+ is a quantifier, that indicates that preceding expression should be matched as many times as possible but at least once.
^ and $ match start and end of the string. For one-line strings they're equivalent to \A and \Z.
This way you restrict content of the whole string to be at least one-charter long and contain any permutation of characters from the character class. What you were doing before hand was to search for a single character from the character class within subject string. This is why s3 that contains a digit matched.
SilentGhost's answer is pretty good, but take note that it would also match strings like "---::::" with no digits at all.
I think you're looking for something like this:
'^(-?\d+:)*-?\d+$'
^ Matches the beginning of the line.
(-?\d+:)* Possible - sign, at least one digit, a colon. That whole pattern 0 or many times.
-?\d+ Then the pattern again, at least once, without the colon
$ The end of the line
This will better match the strings you describe.
pattern = r'\A([^-:0-9]+|[A-Za-z0-9])\Z'
Your regular expression is almost fine; you just need to make it match the whole string. Also, as a commenter pointed out, you don't really need a raw string (the r prefix on the string) in this case. Voila:
def checkString(s):
if re.match('[-:0-9]+$', s):
print "Matches pattern."
else:
print "Does not match pattern."
The '+' means "match one or more of the previous expression". (This will make checkString return False on an empty string. If you want True on an empty string, change the '+' to a '*'.) The '$' means "match the end of the string".
re.match means "the string must match the regular expression starting at the first character"; re.search means "the regular expression can match a sequence anywhere inside the string".
Also, if you like premature optimization--and who doesn't!--note that 're.match' needs to compile the regular expression each time. This version compiles the regular expression only once:
__checkString_re = re.compile('[-:0-9]+$')
def checkString(s):
global __checkString_re
if __checkString_re.match(s):
print "Matches pattern."
else:
print "Does not match pattern."