Replacing specific characters after a string match - python

I'm looking to replace specific characters in numbers I'm extracting but I cannot figure out how to do so.
Here "," is the float separator and (' or .) are thousands separators. I can match this way :
>>> myString = "I buy 456'123,45 then 45.654 then 123. I'm 30."
>>> re.findall(r"(?:\d+)(?:['|.]\d+)+(?:[,]\d+)?", myString)
["456'123,45", '45.654']
I want to replace in my string all thousands separator to get this :
>>> newString
"I buy 456123,45 then 45654 then 123. I'm 30."
I'm pretty sure I need to use groups and subgroups in order to replace what I want but I don't know how to deal with groups when "()+" is present, the length of the number can also be very long
(e.g : 123'456'789'123'456'789,123)
Thanks

You may use re.sub with
(?<=\d)['.](?=\d)
and replace with an empty string. See the regex demo.
Details
(?<=\d) - (positive lookbehind) a digit must appear immediately to the left of the current location
['.] - a single quote or a dot
(?=\d) - (positive lookahead) a digit must appear immediately to the right of the current location.
Python:
re.sub(r"(?<=\d)['.](?=\d)", "", myString)

Related

Match a piece of text from the beginning up to the first occurrence of multicharacter substring

I want a regex search to end when it reaches ". ", but not when it reaches "."; I'm aware of using [^...] to exclude single characters, and have been using this to stop my search when it reaches a certain character. This does not work with strings though, as [^. ] stops when it reaches either character. Say I've got the code
import re
def main():
my_string = "The value of the float is 2.5. The int's value is 2.\n"
re.search("[^.]*", my_string)
main()
Which gives a match object with the string
"The value of the float is 2"
How can I change this so that it only stops after the string ". "?
Bonus question, is there any way to tell regex to stop whenever it reaches one of multiple strings? Using the above code as an example, if I wanted the search to end when it found the string ". " or the string ".\n", how would I go about it? Thanks!
To match from the start of a string till the . followed with whitespace, use
^(.*?)\.\s
If you want to only require a space or newline after a dot, use either of (the second is best if you have single chars only, use alternation if there are multicharacter alternatives)
^(.*?)\.(?: |\n)
^(.*?)\.[ \n]
See the regex demo.
Details
^ - start of a string
(.*?) - Capturing group 1: any 0+ chars other than linebreak chars, as few as possible
\. - a literal . char
\s - a whitespace char
(?: |\n) / [ \n] - a non-capturing group matching either a space or (|) a newline.
Python demo:
import re
my_string = "The value of the float is 2.5. The int's value is 2.\n"
m = re.search("^(.*?)\.\s", my_string) # Try to find a match
if m: # If there is a match
print(m.group(1)) # Show Group 1 value
NOTE If there can be line breaks in the input, pass re.S or re.DOTALL flag:
m = re.search("^(.*?)\.\s", my_string, re.DOTALL)
Besides classic approach explained by Wiktor, also splitting may be interesting solution in this case.
>>> my_string
"The value of the float is 2.5. The int's value is 2.\n"
>>> re.split('\. |\.\n', my_string)
['The value of the float is 2.5', "The int's value is 2", '']
If you want to include periods at the end of the sentence, you can do something like this:
['{}.'.format(sentence) for sentence in re.split('\. |\.\n', my_string) if sentence]
To handle multiple empty spaces between the sentences:
>>> str2 = "The value of the float is 2.5. The int's value is 2.\n\n "
>>> ['{}.'.format(sentence)
for sentence in re.split('\. \s*|\.\n\s*', str2)
if sentence
]
['The value of the float is 2.5.', "The int's value is 2."]

Python regex match only if standalone

Using re in python3, I want to match appearances of percentages in text, and substitute them with a special token (e.g. substitute "A 30% increase" by "A #percent# increase").
I only want to match if the percent expression is a standalone item. For example, it should not match "The product's code is A322%n43%". However, it should match when a line contains only one percentage expression like "89%".
I've tried using delimiters in my regex like \b, but because % is itself a non-alphanumeric character, it doesn't catch the end of the expression. Using \s makes it impossible to catch expression standing by themselves in a line.
At the moment, I have the code:
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "1,211.21%")
' #percent '
which still matches if the expression is followed by letters or other text (like the product code example above).
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "EEE1,211.21%asd")
'EEE #percent# asd'
What would you recommend?
Looks like a perfect job for Negative Lookbehind and Negative Lookahead:
re.sub(r'''(?<![^\s]) [+-]?[.,;]? (\d+[.,;']?)+% (?![^\s.,;!?'"])''',
'#percent#', string, flags=re.VERBOSE)
(?<![^\s]) means "no space immediately before the current position is allowed" (add more forbidden characters if you need).
(?![^\s.,;!?'"]) means "no space, period, etc. immediately after the current position are allowed".
Demo: https://regex101.com/r/khV7MZ/1.
Try putting "first" capture group with a "second".
original: r"[+-]?[.,;]?(\d+[.,;']?)+%"
suggestd: r"[+-]?[.,;]?((\d+[.,;']?)+%)\b"

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Extract text between double square brackets in Python

If I have a string that may look like this:
"[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
How do I extract the categories and put them into a list?
I'm having a hard time getting the regular expression to work.
To expand on the explanation of the regex used by Avinash in his answer:
Category:([^\[\]]*) consists of several parts:
Category: which matches the text "Category:"
(...) is a capture group meaning roughly "the expression inside this group is a block that I want to extract"
[^...] is a negated set which means "do not match any characters in this set".
\[ and \] match "[" and "]" in the text respectively.
* means "match zero or more of the preceding regex defined items"
Where I have used ... to indicate that I removed some characters that were not important for the explanation.
So putting it all together, the regex does this:
Finds "Category:" and then matches any number (including zero) characters after that that are not the excluded characters "[" or "]". When it hits an excluded character it stops and the text matched by the regex inside the (...) part is returned. So the regex does not actually look for "[[" or "]]" as you might expect and so will match even if they are left out. You could force it to look for the double square brackets at the beginning and end by changing it to \[\[Category:([^\[\]]*)\]\].
For the second regex, Category:[^\[\]]*, the capture group (...) is excluded, so Python returns everything matched which includes "Category:".
Seems like you want something like this,
>>> str = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
>>> re.findall(r'Category:([^\[\]]*)', str)
['Political culture', 'Political ideologies']
>>> re.findall(r'Category:[^\[\]]*', str)
['Category:Political culture', 'Category:Political ideologies']
By default re.findall will print only the strings which are matched by the pattern present inside a capturing group. If no capturing group was present, then only the findall function would return the matches in list. So in our case , this Category: matches the string category: and this ([^\[\]]*) would capture any character but not of [ or ] zero or more times. Now the findall function would return the characters which are present inside the group index 1.
Python code:
s = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
cats = [line.strip().strip("[").strip("]") for line in s.splitlines() if line]
print(cats)
Output:
['Category:Political culture', 'Category:Political ideologies']

Python regex, fetch names from a string

I have a string in the form of:
"[NUM : NAME : NUM]: [NUM : NAME : NUM]:..."
I want to be able to extract all the NAMEs out of this string. The NAME can have any character, ranging from alphabet to punctuation symbols and numbers. NUM is only in the form of [0-9]+
I tried issuing this command:
re.findall(r"\[[0-9]+\:([.]+)\:[0-9]+\]", string)
But instead of giving what I requested, it would bunch up a few [NUM : NAME : NUM]s into the [.]+ group, which is also correct in terms of this regex, but not what I need.
Any help would be much appreciated.
Try this:
re.findall(r"\[[0-9]+\:(.+?)\:[0-9]+\]", string)
Adding the ? after the + is non-greedy. Greedy means that the + will take as many characters as possible while still matching and it is greedy by default. By adding the ? you are telling it to be non-greedy, which means the + will take the minimum number of characters to match.
The above will work if there are no spaces between num, :, and name.
If there are spaces then use:
re.findall(r"\[[0-9]+ \: (.+?) \: [0-9]+\]", string)
First problem is that you have enclosed . inside a character class.
So, you have lost the meaning of ., and it only matches just a
dot(.).
Secondly, you are not considering spaces after the numbers in your
string.
Thirdly, you need to use reluctant quantifier with your .+ in the
center. So, replace - ([.]+) with (.+?).
Fourthly, you don't need to escape your colons (:).
You can try out this: -
>>> re.findall(r'\[[0-9]+[ ]*:(.+?):[ ]*[0-9]+\]', string)
6: [' NAME ', ' NAME ']

Categories