replace more than white space with some value in python - python

I want to replace more than one white spaces from string with "#".
If one white space is there it should be intact but if there is more than one consecutive whitespace then it will keep one and append #. For example
s = "Hello how are you."
would become
"Hello how #are ##you"

Python 2.7:
import re
s = "Hello how are you"
s = re.sub("(?<= ) ", "#" ,s)
print s
Or, if you want to have only one # signifying "multiple spaces", change to ("(?<= ) +", "#" ,s)
Explanation: The regex contains a positive lookbehind (?<= ) : it only finds spaces that are preceded by another space, but does not include the first space in the results. Because of that, when the results are replaced by an #, the first space remains intact (it is not preceded by another one), all the others are replaced in a one-by-one fashion by #.
Adding + to the main expression means that it will collect all multiple spaces except for the first one (due to positive lookbehind) and replace them with a single #.
This pattern will only get the " ", if you want to cover tabs, too, you'd need to change to \s

Edit: Having looked at the source of your question, my answer didn't fix it. I couldn't see the extra spaces in your source string. A regular expression as sg.sysel suggests will do the job nicely.
In case you did want to do it yourself with a loop:
def addats(s):
i = 0
r = ''
for c in s:
if c == ' ':
if i > 0:
r += '#'
else:
r += ' '
i += 1
else:
r += c
i = 0
return r
Note that for a real application, you should use something mutable like a list instead of a string for r there, but this should solve your immediate problem.

Related

Delete first 3 characters of string in Python

I'm trying to delete up some initial preceding characters in a string in Python 2.7. To be more specific, the string is an mx record that looks like 10 aspmx2.googlemail.com. I need to delete the preceding number (which can be single or double digits) and space character.
Here is the code I've come up with thus far, but I'm stuck
mx_name = "10 aspmx2.googlemail.com"
for i in range(0,3):
char = mx_name[i]
if char == "0123456789 ":
short_mx_name.replace(char, "")
For some reason, the if statement is not working correctly and I fail to see why. Any help would be much appreciated.
Thank you.
You can use re.sub:
import re
mx_name = "10 aspmx2.googlemail.com"
new_name = re.sub("^\d+\s", '', mx_name)
Output:
'aspmx2.googlemail.com'
Regex explanation:
^:anchor for the expression, forcing it to start its search at the beginning of the string
\d+:finds all digits until a non numeric character (in this case the space) is found.
\s: empty whitespace, must be included in this example so that the substitution also catches the space between the digit and email.
In short, ^\d+\s starts the search at the beginning of the string, finds all proceeding digits, and lastly targets the space to make sure that the regex is not scanning part of the email.
mx_name.split()[1]
Output:
'aspmx2.googlemail.com'
Using split function
mx_name = "10 aspmx2.googlemail.com"
mx_name_url = mx_name.strip().split(' ')[1]
# aspmx2.googlemail.com
Using slice function
mx_name = "10 aspmx2.googlemail.com"
mx_name[3:]
# aspmx2.googlemail.com
You can use regex :
import re
pattern=r'\b[\d\s]{1,3}\b'
string='10 aspmx2.googlemail.com'
new_string=re.sub(pattern,"",string)
print(new_string)
output:
aspmx2.googlemail.com
with single digit:
string='1 aspmx2.googlemail.com' then output:
aspmx2.googlemail.com
You should use regex for that; There are plenty of regex answers to this question but if you want a more abstract solution you can use:
m = "10 aspmx2.googlemail.com"
match = re.search('(?:\s)(\w.*#.*\.)', m)
match.group(1)
'aspmx2.googlemail.com'
This pattern will match any email address after the first space.
(?:\s) - non capturing space char
(\w.*#.*\.) - matches alphanumeric character and the underscore followed by # and anything after in its own group
This will match 4123 name#email.com or some_text name#email.com etc.
The minimum modification to your code would be this:
mx_name = "10 aspmx2.googlemail.com"
short_name = mx_name[:]
for i in range(0,3):
char = mx_name[i]
if char in "0123456789 ":
short_name = short_name.replace(char, "", 1)
Your if was checking if the char WAS 1234567890, not if it was included in that set. Also including the 1 is needed to avoid deelting digits and spaces further in the string.

Python 3.5: Formatting a String with Spaces

I have seen questions similar to this, yet none that address this particular issue. I have a calculator expression using +, -, *, or / operators, and I want to standardize it so that anything someone enters will be homogenous with how my program wants it...
My program wants a string of the format " 10 - 7 * 5 / 2 + 3 ", with the spaces before and after, and in-between each value. I want to take anything someone enters such as "10-7*5/2+3" or " 10- 7*5/2 + 3 ", and make it into the first format I specified.
My first idea was to convert the string to a list, then join with spaces in-between and concatenate the spaces on the front and end, but the clear problem with that is that the '10' gets split into '1' and '0' and comes out as '1 0' after joining.
s = s.replace(" ", "")
if s[0] == "-":
s = "0" + s
else:
s = s
s = " " + " ".join(list(s)) + " "
I was thinking maybe doing something with RegEx might help, but I'm not entire sure how to put that into action. The main slip up for me mentally is getting the '10' and other higher order numbers not to split apart into their constituents when I do this.
I'm in python 3.5.
Solution
One idea if you're only ever dealing with very simple calculator expressions (i.e. digits and operands). If you also have other possible elements, you'd just have to adjust the regex.
Use a regex to extract the relevant pieces, ignoring whitespace, and then re-compose them together using a join.
def compose(expr):
elems = re.findall(r'(\d+|[\+,\-,\*,/])', expr) # a group consists of a digit sequence OR an operand
return ' ' + ' '.join(elems) + ' ' # puts a single space between all groups and one before and after
compose('10- 7*5/2 + 3')
# ' 10 - 7 * 5 / 2 + 3 '
compose('10-7*5/2+3')
# ' 10 - 7 * 5 / 2 + 3 '
Detailed Regex Explanation
The meat of the re.findall call is the regular expression: r'(\d+|[\+,\-,\*,/])'
The first bit: \d means match one digit. + means match one or more of the preceding expression. So together \d+ means match one or more digits in a row.
The second bit: [...] is the character-set notation. It means match one of any of the characters in the set. Now +, -, * are all special regex chars, so you have to escape them with a backslash. Forward slash is not special, so it does not require an escape. So [\+,\-,\*,/] means match one of any of +, -, *, /.
The | in between the two regexes is your standard OR operator. So match either the first expression OR the second one. And parenthesis are group notation in regexes, indicating what is the part of the regex you actually want to be returned.
I'd suggest taking a simple and easy approach; remove all spaces and then go through the string character by character, adding spaces before and after each operator symbol.
Anything with two operators in a row is going to be invalid syntax anyway, so you can leave that to your existing calculator code to throw errors on.
sanitised_string = ""
for char in unformatted_string_without_spaces:
if char in some_list_of_operators_you_made:
sanitised_string += " " + char + " "
else:
sanitised_string += char
Just like #fukanchik suggested, this is usually done in reverse, as in breaking the input string down into its basic components, and then re-assembling it again as you like.
I'd say you are on the right track using RegEx, as it's perfect for parsing this kind of input (perfect as in you don't need to write a more advanced parser). For this, just define all your symbols as little regexes:
lexeme_regexes = [r"\+", "-", r"\*", "/", "\d+"]
and then assemble a big regex that you can use for "walking" your input string:
regex = re.compile("|".join(lexeme_regexes))
lexemes = regex.findall("10 - 7 * 5 / 2 + 3")
To get to your normalized form, just assemble it again:
normalized = " ".join(lexemes)
This example doesn't ensure that all operators are seemlessly split by whitespace though, that'll need some more effort.

insert char with regular expression

I have a string '(abc)def(abc)' and I would like to turn it into '(a|b|c)def(a|b|c)'. I can do that by:
word = '(abc)def(abc)'
pattern = ''
while index < len(word):
if word[index] == '(':
pattern += word[index]
index += 1
while word[index+1] != ')':
pattern += word[index]+'|'
index += 1
pattern += word[index]
else:
pattern += word[index]
index += 1
print pattern
But I want to use regular expression to make it shorter. Can you show me how to insert char '|' between only characters that are inside the parentheses by regular expression?
How about
>>> import re
>>> re.sub(r'(?<=[a-zA-Z])(?=[a-zA-Z-][^)(]*\))', '|', '(abc)def(abc)')
'(a|b|c)def(a|b|c)'
(?<=[a-zA-Z]) Positive look behind. Ensures that the postion to insert is preceded by an alphabet.
(?=[a-zA-Z-][^)(]*\)) Postive look ahead. Ensures that the postion is followed by alphabet
[^)(]*\) ensures that the alphabet within the ()
[^)(]* matches anything other than ( or )
\) ensures that anything other than ( or ) is followed by )
This part is crutial, as it does not match the part def since def does not end with )
I dont have enough reputation to comment, but the regex you are looking for will look like this:
"(.*)"
For each string you find, insert the parentheses between each pair of characters.
let me explain each part of the regex:
( - *represends the character.*
. - A dot in regex represends any possible character.
\* - In regex, this sign represends zero to infinite appearances of the previous character.
) - *represends the character.*
This way, you are looking for any appearance of "()" with characters between them.
Hope I helped :)
([^(])(?=[^(]*\))(?!\))
Try this.Replace with \1|.See demo.
https://regex101.com/r/sH8aR8/13
import re
p = re.compile(r'([^(])(?=[^(]*\))(?!\))')
test_str = "(abc)def(abc)"
subst = "\1|"
result = re.sub(p, subst, test_str)
If you have only single characters in your round brackets, then what you could do would be to simply replace the round brackets with square ones. So the initial regex will look like this: (abc)def(abc) and the final regex will look like so: [abc]def[abc]. From a functional perspective, (a|b|c) has the same meaning as [abc].
A simple Python version to achieve the same thing. Regex is a bit hard to read and often hard to debug or change.
word = '(abc)def(abc)'
split_w = word.replace('(', ' ').replace(')', ' ').split()
split_w[0] = '|'.join( list(split_w[0]) )
split_w[2] = '|'.join( list(split_w[2]) )
print "(%s)%s(%s)" % tuple(split_w)
We split the given string into three parts, pipe-separate the first and the last part and join them back.

Split leading whitespace from rest of string

I'm not sure how to exactly convey what I'm trying to do, but I'm trying to create a function to split off a part of my string (the leading whitespace) so that I can edit it with different parts of my script, then add it again to my string after it has been altered.
So lets say I have the string:
" That's four spaces"
I want to split it so I end up with:
" " and "That's four spaces"
You can use re.match:
>>> import re
>>> re.match('(\s*)(.*)', " That's four spaces").groups()
(' ', "That's four spaces")
>>>
(\s*) captures zero or more whitespace characters at the start of the string and (.*) gets everything else.
Remember though that strings are immutable in Python. Technically, you cannot edit their contents; you can only create new string objects.
For a non-Regex solution, you could try something like this:
>>> mystr = " That's four spaces"
>>> n = next(i for i, c in enumerate(mystr) if c != ' ') # Count spaces at start
>>> (' ' * n, mystr[n:])
(' ', "That's four spaces")
>>>
The main tools here are next, enumerate, and a generator expression. This solution is probably faster than the Regex one, but I personally think that the first is more elegant.
Why don't you try matching instead of splitting?
>>> import re
>>> s = " That's four spaces"
>>> re.findall(r'^\s+|.+', s)
[' ', "That's four spaces"]
Explanation:
^\s+ Matches one or more spaces at the start of a line.
| OR
.+ Matches all the remaining characters.
One solution is to lstrip the string, then figure out how many characters you've removed. You can then 'modify' the string as desired and finish by adding the whitespace back to your string. I don't think this would work properly with tab characters, but for spaces only it seems to get the job done:
my_string = " That's four spaces"
no_left_whitespace = my_string.lstrip()
modified_string = no_left_whitespace + '!'
index = my_string.index(no_left_whitespace)
final_string = (' ' * index) + modified_string
print(final_string) # That's four spaces!
And a simple test to ensure that we've done it right, which passes:
assert final_string == my_string + '!'
One thing you can do it make a list out of string.that is
x=" That's four spaces"
y=list(x)
z="".join(y[0:4]) #if this is variable you can apply a loop over here to detect spaces from start
k="".join(y[4:])
s=[]
s.append(z)
s.append(k)
print s
This is a non regex solution which will not require any imports

Python: Ignore a # / and random numbers in a string

I use part of code to read a website and scrap some information and place it into Google and print some directions.
I'm having an issue as some of the information. the site i use sometimes adds a # followed by 3 random numbers then a / and another 3 numbers e.g #037/100
how can i use python to ignore this "#037/100" string?
I currently use
for i, part in enumerate(list(addr_p)):
if '#' in part:
del addr_p[i]
break
to remove the # if found but I'm not sure how to do it for the random numbers
Any ideas ?
If you find yourself wanting to remove "three digits followed by a forward slash followed by three digits" from a string s, you could do
import re
s = "this is a string #123/234 with other stuff"
t = re.sub('#\d{3}\/\d{3}', '', s)
print t
Result:
'this is a string with other stuff'
Explanation:
# - literal character '#'
\d{3} - exactly three digits
\/ - forward slash (escaped since it can have special meaning)
\d{3} - exactly three digits
And the whole thing that matches the above (if it's present) is replaced with '' - i.e. "removed".
import re
re.sub('#[0-9]+\/[0-9]+$', '', addr_p[i])
I'm no wizzard with regular expressions but i'd imagine you could so something like this.
You could even handle '#' in the regexp as well.
If the format is always the same, then you could check if the line starts with a #, then set the string to itself without the first 8 characters.
if part[0:1] == '#':
part = part[8:]
if the first letter is a #, it sets the string to itself, from the 8th character to the end.
I'd double your problems and match against a regular expression for this.
import re
regex = re.compile(r'([\w\s]+)#\d+\/\d+([\w\s]+)')
m = regex.match('This is a string with a #123/987 in it')
if m:
s = m.group(1) + m.group(2)
print(s)
A more concise way:
import re
s = "this is a string #123/234 with other stuff"
t = re.sub(r'#\S+', '', s)
print(t)

Categories