Python, RegEx, Replace a certain part of a match - python

I am trying to replace a certain part of a match that a regex found.
The relevant strings have the following format:
"<Random text>[Text1;Text2;....;TextN]<Random text>"
So basically there can be N Texts seperated by a ";" inside the brackets.
My goal is to change the ";" into a "," (but only for the strings which are in this format) so that I can keep the ";" as a seperator for a CSV file. So the result should be:
"<Random text>[Text1,Text2,...,TextN]<Random text>"
I can match the relevant strings with something like
re.compile(r'\[".*?((;).*?){1,4}"\]')
but if I try to use the sub method it replaces the whole string.
I have searched stackoverflow and I am pretty sure that "capture groups" might be the solution but I am not really getting there.
Can anyone help me?
I ONLY want to change the ";" in the ["Text1;...;TextN"]-parts of my text file.

Try this regex:
;(?=(?:(?!\[).)*])
Replace each match with a ,
Click for Demo
Explanation:
; - matches a ;
(?=(?:(?!\[).)*]) - makes sure that the above ; is followed by a closing ] somewhere later in the string but before any opening bracket [
(?=....) - positive lookahead
(?:(?!\[).)* - 0+ occurrences of any character which does not start with [
] - matches a ]

If you want to match a ; before a closing ] and not matching [ in between you could use:
;(?=[^[]*])
; Match literally
(?= Positive lookahead, assert what is on the right is
[^[]* Negated character class, match 0+ times any char except [
] Match literally
) Close lookahead
Regex demo
Note that this will also match if there is no leading [
If you also want to make sure that there is a leading [ you could make use of the PyPi regex module and use \G and \K to match a single ;
(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;
Regex demo | Python demo
import regex
pattern = r"(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;"
test_str = ("[\"Text1;Text2;....;TextN\"];asjkdjksd;ajksdjksad[\"Text1;Text2;....;TextN\"]\n\n"
".[\"Text1;Text2\"]...long text...[\"Text1;Text2;Text3\"]....long text...[\"Text1;...;TextN\"]...long text...\n\n"
"I ONLY want to change the \";\" in the [\"Text1;...;TextN\"]")
result = regex.sub(pattern, ",", test_str)
print (result)
Output
["Text1,Text2,....,TextN"];asjkdjksd;ajksdjksad["Text1,Text2,....,TextN"]
.["Text1,Text2"]...long text...["Text1,Text2,Text3"]....long text...["Text1,...,TextN"]...long text...
I ONLY want to change the ";" in the ["Text1,...,TextN"]

You can try this code sample:
import re
x = 'anbhb["Text1;Text2;...;TextN"]nbgbyhuyg["Text1;Text2;...;TextN"][]nhj,kji,'
for i in range(len(x)):
if x[i] == '[' and x[i + 1] == '"':
while x[i+2] != '"':
list1 = list(x)
if x[i] == ';':
list1[i] = ','
x = ''.join(list1)
i = i + 1
print(x)

Related

Regex to ignore data between brackets

I replace characters { , } , : , , with an empty string using below:
This code :
s = "\":{},"
print(s)
print(re.sub(r'\"|{|}' , "",s))
prints:
":{},
:,
which is expected.
I'm attempting to modify the regex to ignore everything between open and closed brackets. So for the string "\":{},[test,test2]" just :,[test,test2] should be returned.
How to modify the regex such that data contained between [ and ] is not applied by the regex.
I tried using:
s = "\":{},[test1, test2]"
print(s)
print(re.sub(r'[^a-zA-Z {}]+\"|{|}' , "",s))
(src: How to let regex ignore everything between brackets?)
None of the , values are replaced .
Assuming your brackets are balanced/unescaped, you may use this regex with a negative lookahead to assert that matched character is not inside [...]:
>>> import re
>>> s = "\":{},[test1,test2]"
>>> print (re.sub(r'[{}",](?![^[]*\])', '', s))
:[test1,test2]
RegEx Demo
RegEx Details:
[{}",]: Match one of those character inside [...]
(?![^[]*\]): Negative lookahead to assert that we don't have a ] ahead of without matching any [ in between, in other words matched character is not inside [...]
If you want to remove the {, }, , and " not inside square brackets, you can use
re.sub(r'(\[[^][]*])|[{}",]', r'\1', s)
See the regex demo. Note you can add more chars to the character set, [{}"]. If you need to add a hyphen, make sure it is the last char in the character set. Escape \, ] (if not the first, right after [) and ^ (if it comes first, right after [).
Details:
(\[[^][]*]) - Capturing group 1: a [...] substring
| - or
[{}",] - a {, }, , or " char.
See a Python demo using your sample input:
import re
s = "\":{},[test1, test2]"
print( re.sub(r'(\[[^][]*])|[{}",]', r'\1', s) )
## => :[test1, test2]

Regular expression for printing integers within brackets

First time ever using regular expressions and can't get it working although there's quite a few examples in stackoverflow already.
How can I extract integers which are in a string inside bracket?
Example:
dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x
would return
[43, 32, -99]
'+' and '-' is okay, if it's in the beginning of the brackets, but not okay if it's in the middle or end. If the '+' sign is in the beginning, it should not be taken into account. (+54 --> 54)
Been trying :
re.findall('\[[-]?\d+\]',str)
but it's not working the way I want.
If you need to fail the match in [ +-34 ] (i.e. if you needn't extract a negative number if there is a + before it) you will need to use
\[\s*(?:\+|(-))?(\d+)\s*]
and when getting a match, concat the Group 1 and Group 2 values. See this regex demo.
Details
\[ - a [ char
\s* - 0+ whitespaces
\+? - an optional + char
(-?\d+) - Capturing group 1 (the actual output of re.findall): an optional - and 1+ digits
\s* - 0+ whitespaces
] - a ] char.
In Python,
import re
text = "dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x"
numbers_text = [f"{x}{y}" for x, y in re.findall(r'\[\s*(?:\+|(-))?(\d+)\s*]', text)]
numbers = list(map(int, numbers_text))
# => [43, 32, -99] for both
If you want to extract integers from a string the code that I use is this:
def stringToNumber(inputStr):
myNumberList = []
for s in inputStr.split():
newString = ''.join(i for i in s if i.isdigit())
if (len(newString) != 0):
myNumberList.append(newString)
return myNumberList
I hope it works for you.
If you've not done so I suggest you switch to the PyPI regex module. Using it here with regex.findall and the following regular expression allows you to extract just what you need.
r'\[ *\+?\K-?\d+(?= *\])'
regex engine <¯\(ツ)/¯> Python code
At the regex tester pass your cursor across the regex for details about individual tokens.
The regex engine performs the following operations.
\[ : match '['
\ * : match 0+ spaces
\+? : optionally match '+'
\K : forget everything matched so far and reset
start of match to current position
-? : optionally match '-'
\d+ : match 1+ digits
(?= *\]) : use positive lookahead to assert the last digit
: matched is followed by 0+ spaces then ']'

How to check if a string is between 2 Strings, and return the following characters with regex

I would like to check if a String is in a text file, between two other Strings, and if true, return the very next String matching a regex...
And I have no clue how to achieve it!
Since you're maybe lost with my explanation, I'll explain it better with my problem:
I'm creating an app (in python) reading a pdf and converting it in .txt.
In this txt, I would like to find the pH and return it. I know that I will find it between section 10 and 11, like this :
10. blablablablabla pH 7,6 blablablabla 11.
So
How can I reduce my research between "10." and "11."?
for the pH part, I think this is something like :
if 'pH' in open(file).read():
If we find 'ph', how can I code that I would like the next String obeying this regex : re.search("[0-9]{1}[,.]?[0-9]?", file)
I would use the following:
regex = re.compile(r"\b10\.(?:(?!\b11\.|\bpH\b).)*\bpH\b\s*(\d+(?:[.,]\d+)?)(?=.*\b11\.)", re.DOTALL)
pH = regex.search(my_string).group(1)
Test it live on regex101.com.
What it does is match a pH value only if it's found between 10. and 11., and if there is more than one, it finds the first one.
Explanation:
\b10\. # Match 10. (but not 110.)
(?: # Start of a (repeating) group that matches...
(?! # (if we're not at the start of either...
\b11\. # the number 11.
| # or
\bpH\b # the string pH
) # )
. # any character (including newlines, therefore the DOTALL option).
)* # Repeat as necessary.
\bpH\b # Match the string pH
\s* # Match optional whitespace
( # Match and capture in group 1:
\d+ # At least one digit
(?:[.,]\d+)? # optionally followed by a decimal part
) # End of capturing group
(?= # Assert that the following can be matched afterwards:
.* # any number of characters
\b11\. # followed by 11.
) # End of lookahead assertion.
This should work, assuming you can put whatever you want in the 234 spot. This returns everything after the pH symbol that matches "234".
import re
my_str = "10. blablablablabla pH 1234 11. 234"
match_list = re.findall(r'10\..*pH.*(234).*11\.', my_str)
print(match_list)
Abstractly this looks for a string matching the following pattern: start_pattern wildcard pre_pattern wildcard captured_pattern wildcard end_pattern All the wild cards are .* which matches 0 or more occurrences of any character. The captured pattern is between the two braces (my_pattern) which in this case is 234
To illustrate my last point better, here is the above with variables:
import re
start_pattern = "10\."
end_pattern = "11\."
pre_pattern = "pH"
wildcard = '.*'
captured_pattern = "234"
my_str = "10. blablablablabla pH 1234 11. 234"
match_list = re.findall(r''
+ start_pattern
+ wildcard
+ pre_pattern
+ wildcard
+ '(' + captured_pattern + ')'
+ wildcard
+ end_pattern
, my_str)
print(match_list)
If I've understood correctly I am assuming that a line starting with 10. will always end with 11.. If so we only need to find the 10. and check what comes after that:
10\.\s.+(?<=pH )(\d[.,]?\d)(?=\s)
This matches the 10. then anything up to a digit which is preceded by "pH " (using a positive look-behind). It then restricts the capture to 2 digits, optionally split by a period or comma
see demo here
UPDATE
Based on the clarifications in the comments, this now has the 11. end delimiter and captures the required digits after the first "pH" found
\b10\.\s.+(?<=pH )(\d[.,]?\d)\s.+?\b11\.
updated demo

insert char with regular expression

I have a string '(abc)def(abc)' and I would like to turn it into '(a|b|c)def(a|b|c)'. I can do that by:
word = '(abc)def(abc)'
pattern = ''
while index < len(word):
if word[index] == '(':
pattern += word[index]
index += 1
while word[index+1] != ')':
pattern += word[index]+'|'
index += 1
pattern += word[index]
else:
pattern += word[index]
index += 1
print pattern
But I want to use regular expression to make it shorter. Can you show me how to insert char '|' between only characters that are inside the parentheses by regular expression?
How about
>>> import re
>>> re.sub(r'(?<=[a-zA-Z])(?=[a-zA-Z-][^)(]*\))', '|', '(abc)def(abc)')
'(a|b|c)def(a|b|c)'
(?<=[a-zA-Z]) Positive look behind. Ensures that the postion to insert is preceded by an alphabet.
(?=[a-zA-Z-][^)(]*\)) Postive look ahead. Ensures that the postion is followed by alphabet
[^)(]*\) ensures that the alphabet within the ()
[^)(]* matches anything other than ( or )
\) ensures that anything other than ( or ) is followed by )
This part is crutial, as it does not match the part def since def does not end with )
I dont have enough reputation to comment, but the regex you are looking for will look like this:
"(.*)"
For each string you find, insert the parentheses between each pair of characters.
let me explain each part of the regex:
( - *represends the character.*
. - A dot in regex represends any possible character.
\* - In regex, this sign represends zero to infinite appearances of the previous character.
) - *represends the character.*
This way, you are looking for any appearance of "()" with characters between them.
Hope I helped :)
([^(])(?=[^(]*\))(?!\))
Try this.Replace with \1|.See demo.
https://regex101.com/r/sH8aR8/13
import re
p = re.compile(r'([^(])(?=[^(]*\))(?!\))')
test_str = "(abc)def(abc)"
subst = "\1|"
result = re.sub(p, subst, test_str)
If you have only single characters in your round brackets, then what you could do would be to simply replace the round brackets with square ones. So the initial regex will look like this: (abc)def(abc) and the final regex will look like so: [abc]def[abc]. From a functional perspective, (a|b|c) has the same meaning as [abc].
A simple Python version to achieve the same thing. Regex is a bit hard to read and often hard to debug or change.
word = '(abc)def(abc)'
split_w = word.replace('(', ' ').replace(')', ' ').split()
split_w[0] = '|'.join( list(split_w[0]) )
split_w[2] = '|'.join( list(split_w[2]) )
print "(%s)%s(%s)" % tuple(split_w)
We split the given string into three parts, pipe-separate the first and the last part and join them back.

How to match exception with double character with Python regular expression?

Got this string and regex findall:
txt = """
dx d_2,222.22 ,,
dy h..{3,333.33} ,,
dz b#(1,111.11) ,, dx-ay relative 4,444.44 ,,
"""
for n in re.findall( r'([-\w]+){1}\W+([^,{2}]+)\s+,,\W+', txt ) :
axis, value = n
print "a:", axis
print "v:", value
In second (value) group I am trying to match anything except double commas, but it seems to catch only one ",". I can got it in this example with simple (.*?) but for certain reasons it got to be everything except ",,". Thank you.
EDIT: To see what I want to accomplish just use r'([-\w]+){1}\W+(.*?)\s+,,\W+' instead. It will give you such output:
a: dx
v: d_2,222.22
a: dy
v: h..{3,333.33}
a: dz
v: b#(1,111.11)
a: dx-ay
v: relative 4,444.44
EDIT #2: Please, answer which did not include double comma exception is not what is needed. Is there a solution...should be. So patern is :
Any whitespace - word with possibly "-" - than " " - and everything to ",," except itself.
[^,{2}] is a character class that matches any character except: ',', '{', '2', '}'
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters.
It should be ([^,]{2})+
( group and capture to \1
[^,]{2} any character except: ',' (2 times)
)+ end of \1
Get the matched group from index 1 and 2
([-\w]+)\s+(.*?)\s+,,
Here is online demo
sample code:
import re
p = re.compile(ur'([-\w]+)\s+(.*?)\s+,,')
test_str = u"..."
re.findall(p, test_str)
Note: use \s* instead of \s+ if spaces are optional.
r'(?<=,,)\s+([-\w]+)\s(.*?)(?:,,)' is expression what is needed here. Much more simpler than I could thought.
r'(?<=,,) is positive lookbehind assertion and it will find a match in string which is after double commas , since the lookbehind will back up 2 chars and check if the contained pattern matches.
(?:,,) as last one is non-capturing version of regular parentheses, so everything in between should match.
\s or \s+ is there only for the matter of this specific type of string.

Categories