Got this string and regex findall:
txt = """
dx d_2,222.22 ,,
dy h..{3,333.33} ,,
dz b#(1,111.11) ,, dx-ay relative 4,444.44 ,,
"""
for n in re.findall( r'([-\w]+){1}\W+([^,{2}]+)\s+,,\W+', txt ) :
axis, value = n
print "a:", axis
print "v:", value
In second (value) group I am trying to match anything except double commas, but it seems to catch only one ",". I can got it in this example with simple (.*?) but for certain reasons it got to be everything except ",,". Thank you.
EDIT: To see what I want to accomplish just use r'([-\w]+){1}\W+(.*?)\s+,,\W+' instead. It will give you such output:
a: dx
v: d_2,222.22
a: dy
v: h..{3,333.33}
a: dz
v: b#(1,111.11)
a: dx-ay
v: relative 4,444.44
EDIT #2: Please, answer which did not include double comma exception is not what is needed. Is there a solution...should be. So patern is :
Any whitespace - word with possibly "-" - than " " - and everything to ",," except itself.
[^,{2}] is a character class that matches any character except: ',', '{', '2', '}'
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters.
It should be ([^,]{2})+
( group and capture to \1
[^,]{2} any character except: ',' (2 times)
)+ end of \1
Get the matched group from index 1 and 2
([-\w]+)\s+(.*?)\s+,,
Here is online demo
sample code:
import re
p = re.compile(ur'([-\w]+)\s+(.*?)\s+,,')
test_str = u"..."
re.findall(p, test_str)
Note: use \s* instead of \s+ if spaces are optional.
r'(?<=,,)\s+([-\w]+)\s(.*?)(?:,,)' is expression what is needed here. Much more simpler than I could thought.
r'(?<=,,) is positive lookbehind assertion and it will find a match in string which is after double commas , since the lookbehind will back up 2 chars and check if the contained pattern matches.
(?:,,) as last one is non-capturing version of regular parentheses, so everything in between should match.
\s or \s+ is there only for the matter of this specific type of string.
Related
I am trying to replace a certain part of a match that a regex found.
The relevant strings have the following format:
"<Random text>[Text1;Text2;....;TextN]<Random text>"
So basically there can be N Texts seperated by a ";" inside the brackets.
My goal is to change the ";" into a "," (but only for the strings which are in this format) so that I can keep the ";" as a seperator for a CSV file. So the result should be:
"<Random text>[Text1,Text2,...,TextN]<Random text>"
I can match the relevant strings with something like
re.compile(r'\[".*?((;).*?){1,4}"\]')
but if I try to use the sub method it replaces the whole string.
I have searched stackoverflow and I am pretty sure that "capture groups" might be the solution but I am not really getting there.
Can anyone help me?
I ONLY want to change the ";" in the ["Text1;...;TextN"]-parts of my text file.
Try this regex:
;(?=(?:(?!\[).)*])
Replace each match with a ,
Click for Demo
Explanation:
; - matches a ;
(?=(?:(?!\[).)*]) - makes sure that the above ; is followed by a closing ] somewhere later in the string but before any opening bracket [
(?=....) - positive lookahead
(?:(?!\[).)* - 0+ occurrences of any character which does not start with [
] - matches a ]
If you want to match a ; before a closing ] and not matching [ in between you could use:
;(?=[^[]*])
; Match literally
(?= Positive lookahead, assert what is on the right is
[^[]* Negated character class, match 0+ times any char except [
] Match literally
) Close lookahead
Regex demo
Note that this will also match if there is no leading [
If you also want to make sure that there is a leading [ you could make use of the PyPi regex module and use \G and \K to match a single ;
(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;
Regex demo | Python demo
import regex
pattern = r"(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;"
test_str = ("[\"Text1;Text2;....;TextN\"];asjkdjksd;ajksdjksad[\"Text1;Text2;....;TextN\"]\n\n"
".[\"Text1;Text2\"]...long text...[\"Text1;Text2;Text3\"]....long text...[\"Text1;...;TextN\"]...long text...\n\n"
"I ONLY want to change the \";\" in the [\"Text1;...;TextN\"]")
result = regex.sub(pattern, ",", test_str)
print (result)
Output
["Text1,Text2,....,TextN"];asjkdjksd;ajksdjksad["Text1,Text2,....,TextN"]
.["Text1,Text2"]...long text...["Text1,Text2,Text3"]....long text...["Text1,...,TextN"]...long text...
I ONLY want to change the ";" in the ["Text1,...,TextN"]
You can try this code sample:
import re
x = 'anbhb["Text1;Text2;...;TextN"]nbgbyhuyg["Text1;Text2;...;TextN"][]nhj,kji,'
for i in range(len(x)):
if x[i] == '[' and x[i + 1] == '"':
while x[i+2] != '"':
list1 = list(x)
if x[i] == ';':
list1[i] = ','
x = ''.join(list1)
i = i + 1
print(x)
s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.
I want to split a string only if there's a space before and after that character. In my case the character is the dash i.e '-'
Example
Opzione - AAAA-11
Should be Splitted in
Opzione AAAA-11
and not in
Opzione AAAA 11
Language is python.
Thanks
You can use lookaround
(?<=\s)-(?=\s)
(?<=\s) -> Positive look behind. condition to check preceding space.
- -> Matches -.
(?=\s) -> Positive lookahead matches following space
On side note - \s will match \r , \t and \n also if you just want to consider space only you can have like this
(?<= )-(?= )
You can do it with regex but how about with non-regex way using split() and join()
str = 'Opzione - AAAA-11';
df = ' '.join(str.split(' - '))
print(df)
str="Opzione - AAAA-11"
str=re.sub('(\s([\S])\s[\S]?)','',str)
This (\s([\S])\s[\S]?) means anything , except a space , between two spaces then anything except a whitespace or not and by this you will be able to match like g h h g.
So , both h are between two spaces but when you match only with \s([\S])\s another h will not but by (\s([\S])\s[\S]?) both will match.
Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.
I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!
Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""
re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"