I have seen questions similar to this, yet none that address this particular issue. I have a calculator expression using +, -, *, or / operators, and I want to standardize it so that anything someone enters will be homogenous with how my program wants it...
My program wants a string of the format " 10 - 7 * 5 / 2 + 3 ", with the spaces before and after, and in-between each value. I want to take anything someone enters such as "10-7*5/2+3" or " 10- 7*5/2 + 3 ", and make it into the first format I specified.
My first idea was to convert the string to a list, then join with spaces in-between and concatenate the spaces on the front and end, but the clear problem with that is that the '10' gets split into '1' and '0' and comes out as '1 0' after joining.
s = s.replace(" ", "")
if s[0] == "-":
s = "0" + s
else:
s = s
s = " " + " ".join(list(s)) + " "
I was thinking maybe doing something with RegEx might help, but I'm not entire sure how to put that into action. The main slip up for me mentally is getting the '10' and other higher order numbers not to split apart into their constituents when I do this.
I'm in python 3.5.
Solution
One idea if you're only ever dealing with very simple calculator expressions (i.e. digits and operands). If you also have other possible elements, you'd just have to adjust the regex.
Use a regex to extract the relevant pieces, ignoring whitespace, and then re-compose them together using a join.
def compose(expr):
elems = re.findall(r'(\d+|[\+,\-,\*,/])', expr) # a group consists of a digit sequence OR an operand
return ' ' + ' '.join(elems) + ' ' # puts a single space between all groups and one before and after
compose('10- 7*5/2 + 3')
# ' 10 - 7 * 5 / 2 + 3 '
compose('10-7*5/2+3')
# ' 10 - 7 * 5 / 2 + 3 '
Detailed Regex Explanation
The meat of the re.findall call is the regular expression: r'(\d+|[\+,\-,\*,/])'
The first bit: \d means match one digit. + means match one or more of the preceding expression. So together \d+ means match one or more digits in a row.
The second bit: [...] is the character-set notation. It means match one of any of the characters in the set. Now +, -, * are all special regex chars, so you have to escape them with a backslash. Forward slash is not special, so it does not require an escape. So [\+,\-,\*,/] means match one of any of +, -, *, /.
The | in between the two regexes is your standard OR operator. So match either the first expression OR the second one. And parenthesis are group notation in regexes, indicating what is the part of the regex you actually want to be returned.
I'd suggest taking a simple and easy approach; remove all spaces and then go through the string character by character, adding spaces before and after each operator symbol.
Anything with two operators in a row is going to be invalid syntax anyway, so you can leave that to your existing calculator code to throw errors on.
sanitised_string = ""
for char in unformatted_string_without_spaces:
if char in some_list_of_operators_you_made:
sanitised_string += " " + char + " "
else:
sanitised_string += char
Just like #fukanchik suggested, this is usually done in reverse, as in breaking the input string down into its basic components, and then re-assembling it again as you like.
I'd say you are on the right track using RegEx, as it's perfect for parsing this kind of input (perfect as in you don't need to write a more advanced parser). For this, just define all your symbols as little regexes:
lexeme_regexes = [r"\+", "-", r"\*", "/", "\d+"]
and then assemble a big regex that you can use for "walking" your input string:
regex = re.compile("|".join(lexeme_regexes))
lexemes = regex.findall("10 - 7 * 5 / 2 + 3")
To get to your normalized form, just assemble it again:
normalized = " ".join(lexemes)
This example doesn't ensure that all operators are seemlessly split by whitespace though, that'll need some more effort.
Related
How to write a regular expression which can handle the following substitution scenario.
Hello, this is a ne-
w line of text wher-
e we are trying hyp-
henation.
i have a short Python code which handles breaking long one_line strings into a multi_line string and produces output similar to the code sample given above
I want a regular expression that takes care of the single hyphenated character like in first and second line and just pulls up the single hyphenated character on the previous like.
something like re.sub("-\n<any character>","<the any character>\n")
I can not find a way on how to handle the hyphenated character
below is some further information about the question
Word = "Python string comparison is performed using the characters in both strings. The characters in both strings are compared one by one."
def hyphenate(word, x):
for i in range(x, len(word), x):
word = word[:i] + ("\n" if (word[i] == " " or word[i-1] == " " ) else "-\n") + (word[i:] if word[i] != " " else word[(i+1):])
return(word)
print(hyphenate(Word, 20))
#Produced output
Python string compar-
ison is performed
using the character- <=
s in both strings.
The characters in b- <=
oth strings are co-
mpared one by one.
#Desired output
Python string compar-
ison is performed
using the characters <=
in both strings.
The characters in <=
both strings are co-
mpared one by one.
You don't need to include the trailing character at all.
re.sub(r'-\n', '')
If for some reason you do need to capture the character, you can use r'\1' to refer back to it.
re.sub(r'-\n([aeiou])', r'\1')
The notation r'...' produces a "raw string" where backslashes only represent themselves. In Python, backslashes in strings are otherwise processed as escapes - for example, '\n' represents the single wharacter newline, whereas r'\n' represents the two literal characters backslash and n (which in a regex match a literal newline).
This question already has answers here:
Finding Plus Sign in Regular Expression
(7 answers)
Closed 2 years ago.
In the Python code, I used re.compile() to to check whether given word is exists.
PATTERNS = {
re.compile(r'[\w\s] + total+ [\w\s] + cases'): data.get_total_cases,
re.compile(r'[\w\s] + total cases'): data.get_total_cases,
re.compile(r'[\w\s] + total + [\w\s] + deaths'): data.get_total_deaths,
re.compile(r'[\w\s] + total deaths'): data.get_total_deaths
}
This did not work as expected. I couldn't find anything wrong. Finally, I removed spaces after every character set [\w\s] because it was the only visible difference between my code and original code that I had referenced.
PATTERNS = {
re.compile(r'[\w\s]+ total+ [\w\s]+ cases'): data.get_total_cases,
re.compile(r'[\w\s]+ total cases'): data.get_total_cases,
re.compile(r'[\w\s]+ total+ [\w\s]+ deaths'): data.get_total_deaths,
re.compile(r'[\w\s]+ total deaths'): data.get_total_deaths
}
Now the code is working and all patterns can be successfully identified. But still I couldn't find why these spaces cause this issue?
The + symbol in a regex expression means "one or more of".
So + means "one or more of (space). And [\w\s]+ means "one or more of any alphanumeric or whitespace characters".
If you are wanting to match a pattern that is like 10 total + 10 cases with a + as a literal, you need to escape the + sign. a raw string (r before the string) allows for literal backslashes in the string, which can be used to escape in the regex pattern.
re.compile(r"[\w\s]+ total \+ [\w\s]+ cases")
Notice the \+ which means "literally a + sign" rather than "one or more of".
I want to replace more than one white spaces from string with "#".
If one white space is there it should be intact but if there is more than one consecutive whitespace then it will keep one and append #. For example
s = "Hello how are you."
would become
"Hello how #are ##you"
Python 2.7:
import re
s = "Hello how are you"
s = re.sub("(?<= ) ", "#" ,s)
print s
Or, if you want to have only one # signifying "multiple spaces", change to ("(?<= ) +", "#" ,s)
Explanation: The regex contains a positive lookbehind (?<= ) : it only finds spaces that are preceded by another space, but does not include the first space in the results. Because of that, when the results are replaced by an #, the first space remains intact (it is not preceded by another one), all the others are replaced in a one-by-one fashion by #.
Adding + to the main expression means that it will collect all multiple spaces except for the first one (due to positive lookbehind) and replace them with a single #.
This pattern will only get the " ", if you want to cover tabs, too, you'd need to change to \s
Edit: Having looked at the source of your question, my answer didn't fix it. I couldn't see the extra spaces in your source string. A regular expression as sg.sysel suggests will do the job nicely.
In case you did want to do it yourself with a loop:
def addats(s):
i = 0
r = ''
for c in s:
if c == ' ':
if i > 0:
r += '#'
else:
r += ' '
i += 1
else:
r += c
i = 0
return r
Note that for a real application, you should use something mutable like a list instead of a string for r there, but this should solve your immediate problem.
This is my first post and I am a newbie to Python. I am trying to get this to work.
string 1 = [1/0/1, 1/0/2]
string 2 = [1/1, 1/2]
Trying to check the string if I see two / then I just need to replace the 0 with 1 so it becomes 1/1/1 and 1/1/2.
If I don't have two / then I need to add one in along with a 1 and change it to the format 1/1/1 and 1/1/2 so string 2 becomes [1/1/1,1/1/2]
Ultimate goal is to get all strings match the pattern x/1/x. Thanks for all the Input on this.I tried this and it seems to work
for a in Port:
if re.search(r'././', a):
z.append(a.replace('/0/','/1/') )
else:
t1= a.split('/')
if len(t1)>1 :
t2= t1[0] + "/1/" + t1[1]
z.append(t2)
few lines are there to take care of some exceptions but seems to do the job.
The regex pattern for identifying a / is just \/
This could be solved rather simply using the built in string functions without having to add all of the overhead and additional computational time caused by using the RegEx engine.
For example:
# The string to test:
sTest = '1/0/2'
# Test the string:
if(sTest.count('/') == 2):
# There are two forward slashes in the string
# If the middle number is a 0, we'll replace it with a one:
sTest = sTest.replace('/0/', '/1/')
elif(sTest.count('/') == 1):
# One forward slash in string
# Insert a 1 between first portion and the last portion:
sTest = sTest.replace('/', '/1/')
else:
print('Error: Test string is of an unknown format.')
# End If
If you really want to use RegEx, though, you could simply match the string against these two patterns: \d+/0/\d+ and \d+/\d+(?!/) If matching against the first pattern fails, then attempt to match against the second pattern. Then, you can use a either grouping, splitting, or simply calling .replace() (like I'm doing above) to format the string as you need.
EDIT: for clarification, I'll explain the two patterns:
Pattern 1: \d+/0/\d+ could essentially be read as "match any number (consisting of one (1) or more digits) followed by a forward slash, a zero (0), another forward slash and then followed by any number (consisting of one (1) or more digits).
Pattern 2: \d+/\d+(?!/) could be read as "match any number (consisting of one (1) or more digits) followed by a forward slash and any other number (consisting of one (1) or more digits) which is then NOT followed by another forward slash." The last part in this pattern could be a little confusing because it uses the negative lookahead abilities of the RegEx engine.
If you wanted to add stricter rules to these patterns to make sure there are not any leading or trailing non-digit characters, you could add ^ to the start of the patterns and $ to the end, to signify the start of the string and the end of the string respectively. This would also allow you to remove the lookahead expression from the second pattern ((?!/)). As such, you would end up with the following patterns: ^\d+/0/\d+$ and ^\d+/\d+$.
https://regex101.com/r/rE6oN2/1
Click code generator on the left side. You get:
import re
p = re.compile(ur'\d/1/\d')
test_str = u"1/1/2"
re.search(p, test_str)
I'm trying to write a regular expression in python, and one of the characters involved in it is the \001 character. putting \001 in a string doesn't seem to work. I also tried 'string' + str(chr(1)), but the regex doesn't seem to catch it. Please for the love of god somebody help me, I've been struggling with this all day.
import sys
import postgresql
import re
if len(sys.argv) != 2:
print("usage: FixToDb <fix log file>")
else:
f = open(sys.argv[1], 'r')
timeExp = re.compile(r'(\d{2}):(\d{2}):(\d{2})\.(\d{6}) (\S)')
tagExp = re.compile('(\\d+)=(\\S*)\001')
for line in f:
#parse the time
m = timeExp.match(line)
print(m.group(1) + ':' + m.group(2) + ':' + m.group(3) + '.' + m.group(4) + ' ' + m.group(5));
tagPairs = re.findall('\\d+=\\S*\001', line)
for t in tagPairs:
tagPairMatch = tagExp.match(t)
print ("tag = " + tagPairMatch.group(1) + ", value = " + tagPairMatch.group(2))
Here's is an example line of for the input. I replaced the '\001' character with a '~' for readability
15:32:36.357227 R 1 0 0 0 8=FIX.4.2~9=0067~35=A~52=20120713-19:32:36~34=1~49=PD~56=P~98=0~108=30~10=134
output:
15:32:36.357227 R
tag = 8, value = FIX.4.29=006735=A52=20120713-19:32:3634=149=PD56=P98=0108=3010=134
So it doesn't stop at the '\001' character.
chr(1) should work, as will "\x01", as will "\001". (Note that chr(1) already returns a string, so you don't need to do str(chr(1)).) In your example it looks like you have both "\001" and chr(1), so that won't work unless you have two of the characters in a row in your data.
You say the regex "doesn't seem to catch it", but you don't give an example of your input data, so it's impossible to say why.
Edit; Okay, it looks like the problem has nothing to do with the \001. It is the classic greediness problem. The \S* in your tagExp expression will match a \001 character (since that character is not whitespace. So the \S* is gobbling the entire line. Use \S*? to make it non-greedy.
Edit: As others have noted, it also looks like your backslashes are awry. In regular expressions you face a backslash-doubling problem: Python uses the backslash for its own string escapes (like \t for tab, \n for newline), but regular expressions also use the backslash for their own purposes (e.g., \s for whitespace). The usual solution is to use raw strings, but you can't do that if you want to use the "\001" escape. However, you could use raw strings for your timeExp regex. Then in your other regexes, double the backslashes (except on \001, because you want that one to be interpreted as a character-code escape).
Instead of using \S to match the value, which can be any non-whitespace character, including \001, you should use [^\x01], which will match any character that is not \001.
#Sam Mussmann, no...
1 (decimal) = \001 (octal) <> \x01 (UNICODE)