Remove spaces from string after and before letter - python

I have a quite a few sums of strings that look like this: "a name / another name / something else".
I want to get to this: "a name/another name/something else".
Basically removing the spaces before and after the forward slashes only (not between the words themselves).
I know nothing about programming but I looked and found that this can be done with Python and Regex. I was a bit overwhelmed though with the amount of information I found.

You can use the pattern:
(?:(?<=\/) | (?=\/))
(?: Non capturing group.
(?<=\/) Lookbehind for /.
| OR
(?=\/) Positive lookahead for /.
) Close non capturing group.
You can try it live here.
Python snippet:
import re
str = 'a name / another name / something else'
print(re.sub(r'(?:(?<=\/) | (?=\/))','',str))
Prints:
a name/another name/something else

There's no need for regex here, since you're simply replacing a string of literals.
str = "a name / another name / something else"
print(str.replace(" / ", "/"))

Here is an answer without using regex that I feel is easier to understand
string = "a name / another name / something else"
edited = "/".join([a.strip() for a in string.split("/")])
print(edited)
output:
a name/another name/something else
.join() joins elements of a sequence by a given seperator, docs
.strip() removes beginning and trailing whitespace, docs
.split() splits the string into tokens by character, docs

This pattern will match for any amount of whitespace surrounding / and remove it. I think the regex is relatively easy to understand
\s*([\/])\s*
Has a capturing group that matches the backslash (that's what the middle part is). The s* parts match whitespace (at least one up to any amount of times).
You can then replace these matched strings with just a / to get rid of all the whitespace.

str1 being your string:
re.sub(" / ", "/" ,str1)

Use the following code to remove all spaces before and after the / character:
import re
str = 'a name / another name / something else'
str = re.sub(r'(?:(?<=\/)\s*|\s*(?=\/))','', str)
Check this document for more information.

Related

Using Python Regex to find a phrase between 2 tags

I've got a string that I want to use regex to find the characters encapsulated between two known patterns, "Cp_6%3A" then some characters then "&" and potentially more characters, or no & and just the end of string.
My code looks like this:
def extract_id_from_ref(ref):
id = re.search("Cp\_6\%3A(.*?)(\& | $)", ref)
print(id)
But this isn't producing anything, Any ideas?
Thanks in advance
Note that (\& | $) matches either the & char and a space after it, or a space and end of string (the spaces are meaningful here!).
Use a negated character class [^&]* (zero or more chars other than &) to simplify the regex (no need for an alternation group or lazy dot matching pattern) and then access .group(1):
def extract_id_from_ref(ref):
m = re.search(r"Cp_6%3A([^&]*)", ref)
if m:
print(m.group(1))
Note that neither _ nor % are special regex metacharacters, and do not have to be escaped.
See the regex demo.
The problem is that spaces in a regex pattern, are also taken into account. Furthermore in order to add a backspace to the string, you either have to add \\ (two backslashes) or use a raw string:
So you should write:
r"Cp_6\%3A(.*?)(?:\&|$)"
If you then match with:
def extract_id_from_ref(ref):
id = re.search(r"Cp_6\%3A(.*?)(?:\&|$)", ref)
print(id)
It should work.

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

How to remove substrings marked with special characters from a string?

I have a string in Python:
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a nummber."
print Tt
'This is a <"string">string, It should be <"changed">changed to <"a">a nummber.'
You see the some words repeat in this part <\" \">.
My question is, how to delete those repeated parts (delimited with the named characters)?
The result should be like:
'This is a string, It should be changed to a nummber.'
Use regular expressions:
import re
Tt = re.sub('<\".*?\">', '', Tt)
Note the ? after *. It makes the expression non-greedy,
so it tries to match so few symbols between <\" and \"> as possible.
The Solution of James will work only in cases when the delimiting substrings
consist only from one character (< and >). In this case it is possible to use negations like [^>]. If you want to remove a substring delimited with character sequences (e.g. with begin and end), you should use non-greedy regular expressions (i.e. .*?).
I'd use a quick regular expression:
import re
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a number."
print re.sub("<[^<]+>","",Tt)
#Out: This is a string, It should be changed to a nummber.
Ah - similar to Igor's post, he beat my by a bit. Rather than making the expression non-greedy, I don't match an expression if it contains another start tag "<" in it, so it will only match a start tag that's followed by an end tag ">".

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Best way to split a string for the last space

I'm wondering the best way to split a string separated by spaces for the last space in the string which is not inside [, {, ( or ". For instance I could have:
a = 'a b c d e f "something else here"'
b = 'another parse option {(["gets confusing"])}'
For a it should parse into ['a', 'b', 'c', 'd', 'e', 'f'], ["something else here"]
and b should parse into ['another', 'parse', 'option'], ['([{"gets confusing"}])']
Right now I have this:
def getMin(aList):
min = sys.maxint
for item in aList:
if item < min and item != -1:
min = item
return min
myList = []
myList.append(b.find('['))
myList.append(b.find('{'))
myList.append(b.find('('))
myList.append(b.find('"'))
myMin = getMin(myList)
print b[:myMin], b[myMin:]
I'm sure there's better ways to do this and I'm open to all suggestions
Matching vs. Splitting
There is an easy solution. The key is to understand that matching and splitting are two sides of the same coin. When you say "match all", that means "split on what I don't want to match", and vice-versa. Instead of splitting, we're going to match, and you'll end up with the same result.
The Reduced, Simple Version
Let's start with the simplest version of the regex so you don't get scared by something long:
{[^{}]*}|\S+
This matches all the items of your second string—the same as if we were splitting (see demo)
The left side of the | alternation matches complete sets of {braces}.
The right side of the | matches any characters that are not whitespace characters.
It's that simple!
The Full Regex
We also need to match "full quotes", (full parentheses) and [full brackets]. No problem: we just add them to the alternation. Just for clarity, I'm throwing them together in a non-capture group (?: so that the \S+ pops out on its own, but there is no need.
(?:{[^{}]*}|"[^"]*"|\([^()]*\)|\[[^][]*\])|\S+
See demo.
Notes Potential Improvements
We could replace the quoted string regex by one that accepts escaped quotes
We could replace the brace, brackets and parentheses expressions by recursive expressions to allow nested constructions, but you'd have to use Matthew Barnett's (awesome) regex module instead of re
The technique is related to a simple and beautiful trick to Match (or replace) a pattern except when...
Let me know if you have questions!
You can use regular expressions:
import re
def parse(text):
m = re.search(r'(.*) ([[({"].*)', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
The first part (.*) catches everything up to the section in quotes or parenthesis, and the second part catches anything starting at a character in ([{".
If you need something more robust, this has a more complicated regular expression, but it will make sure that the opening token is matched, and it makes the last expression optional.
def parse(text):
m = re.search(r'(.*?)(?: ("[^"]*"|\([^)]*\)|\[[^]]*\]|\{[^}]*\}))?$', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
Perhaps this link will help:
Split a string by spaces -- preserving quoted substrings -- in Python
It explains how to preserve quoted substrings when splitting a string by spaces.

Categories