Python replace middle digits with commas thousand separator - python

I have a string like this:
123456789.123456789-123456789
Before and after the decimal/hyphen there can be any number of digits, what I need to do is remove everything before the decimal including the decimal and remove the hyphen and everything after the hyphen. Then with the middle group of digits (that I need to keep) I need to place a comma thousands separators. 
So here the output would be: 
123,456,789
I can use lookarounds to capture the digits in the middle but then it wont replace the other digits and i'm not sure how to place commas using lookarounds. 
(?<=\.)\d+(?=-)
Then I figured I could use a capturing group like so which will work, but not sure how to insert the comma's
\d+\.(\d+)-\d+
How could I insert comma's using one of the above regex?

Don't try to insert the thousands separators with a regex; just pick out that middle number and use a function to produce the replacement; re.sub() accepts a function as replacement pattern:
re.sub(r'\d+\.(\d+)-\d+', lambda m: format(int(m.group(1)), ','), inputtext)
The , format for integers when used in the format() function handles formatting a number to one with thousands separators:
>>> import re
>>> inputtext = '123456789.123456789-123456789'
>>> re.sub(r'\d+\.(\d+)-\d+', lambda m: format(int(m.group(1)), ','), inputtext)
'123,456,789'
This will of course still work in a larger body of text containing the number, dot, number, dash, number sequence.
The format() function is closely related to the str.format() method but doesn't require a full string template (so no {} placeholder or field names required).

You've asked for a full regular expression here, It would probably be easier to split your string..
>>> import re
>>> s = '123456789.123456789-123456789'
>>> '{:,}'.format(int(re.split('[.-]', s)[1]))
123,456,789
If you prefer using regular expression, use a function call or lambda in the replacement:
>>> import re
>>> s = '123456789.123456789-123456789'
>>> re.sub(r'\d+\.(\d+)-\d+', lambda m: '{:,}'.format(int(m.group(1))), s)
123,456,789
You can take a look at the different format specifications.

Related

Extract Only Digits from Dollar Figures

What I'm trying to do is extract only the digits from dollar figures.
Format of Input
...
$1,289,868
$62,000
$421
...
Desired Output
...
1289868
62000
421
...
The regular expression that I was using to extract only the digits and commas is:
r'\d+(,\d+){0,}'
which of course outputs...
...
1,289,868
62,000
421
...
What I'd like to do is convert the output to an integer (int(...)), but obviously this won't work with the commas. I'm sure I could figure this out on my own, but I'm running really short on time right now.
I know I can simply use r'\d+', but this obviously separates each chunk into separate matches...
You can't match discontinuous texts within one match operation. You can't put a regex into re.findall against 1,345,456 to receive 1345456. You will need to first match the strings you need, and then post-process them within code.
A regex you may use to extract the numbers themselves
re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)
See this regex demo.
Alternatively, you may use a bit more general regex to be used with re.findall:
r'\$(\d+(?:,\d+)*)'
See this regex demo.
Note that re.findall will only return the captured part of the string (the one matched with the (...) part in the regex).
Details
\$ - a dollar sign
(\d{1,3}(?:,\d{3})*) - Capturing group 1:
\d{1,3} - 1 to 3 digits (if \d+ is used, 1 or more digits)
(?:,\d{3})* - 0 or more sequences of
, - a comma
\d{3} - 3 digits (or if \d+ is used, 1 or more digits).
Python code sample (with removing commas):
import re
s = """$1,289,868
$62,000
$421"""
result = [x.replace(",", "") for x in re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)]
print(result) # => ['1289868', '62000', '421']
Using re.sub
Ex:
import re
s = """$1,289,868
$62,000
$421"""
print([int(i) for i in re.sub(r'[^0-9\s]', "", s).splitlines()])
Output:
[1289868, 62000, 421]
You don't need regex for this.
int(''.join(filter(str.isdigit, "$1,000,000")))
works just fine.
If you did want to use regex for some reason:
int(''.join(re.findall(r"\d", "$1,000,000")))
If you know how to extract the numbers with comma groupings, the easiest thing to do is just transform that into something int can handle:
for match in matches:
i = int(match.replace(',', ''))
For example, if match is '1,289,868', then match.replace(',', '') is '1289868', and obviously int(<that>) is 1289868.
You dont need regex for this. Just string operations should be enough
>>> string = '$1,289,868\n$62,000\n$421'
>>> [w.lstrip('$').replace(',', '') for w in string.splitlines()]
['1289868', '62000', '421']
Or alternatively, you can use locale.atoi to convert string of digits with commas to int
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
>>> list(map(lambda x: locale.atoi(x.lstrip('$')), string.splitlines()))
[1289868, 62000, 421]

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Python re.sub not returning expected string of int

I would like to left pad zeros to the number in a string. For example, the string
hello120_c
padded to 5 digits should become
hello00120_c
I would like to use re.sub to make the replacement. Here is my code:
>>> re.sub('(\d+)', r'\1'.zfill(5), 'hello120_c')
which returns
>>> 'hello000120_c'
which has 6 digits rather than 5. Checking '120'.zfill(5) alone gives '00120'. Also, re.findall appears to confirm the regular expression is matching the full '120'.
What is causing re.sub to act differently?
You cannot use the backreference directly. Use a lamda:
re.sub(r'\d+', lambda x: x.group(0).zfill(5), 'hello120_c')
# => hello00120_c
Also, note that you do not need a capturing group since you can access the matched value via .group(0). Also, note the r'...' (raw string literal) used to declare the regex.
See IDEONE demo:
import re
res = re.sub(r'\d+', lambda x: x.group(0).zfill(5), 'hello120_c')
print(res)

How to replace repeated pattern of characters

I have a string that has pairs of random characters repeating 3 times within it, for ex "abababwhatevercdcdcd" and i want to remove these pairs to get the rest of the string, like "whatever" in the former example, how do i do that?
I tried the following:
import re
re.sub(r'([a-z0-9]{2}){3}', r'', string)
but it does not work
You need backreferences here in order to repeat the match that was actually made, as opposed to trying to make a new match with the same pattern:
([a-z0-9]{2})\1\1
>>> import re
>>> re.sub(r'([a-z0-9]{2})\1\1', r'', "abababwhatevercdcdcd")
'whatever'
>>> re.sub(r'([a-z0-9]{2})\1\1', r'', "wabababhatevercdcdcd")
'whatever'
For more than one character, you can use :
(.{2,})\1+

How to make re.split() inclusive

I have the following:
>>> x='STARSHIP_TROOPERS_INVASION_2012_LOCDE'
>>> re.split('_\d{4}',x)[0]
'STARSHIP_TROOPERS_INVASION'
How would I get the year included? For example:
STARSHIP_TROOPERS_INVASION_2012
Note there are tens of thousands of titles, and I need to split on the year for each. I can't do a normal python split() here.
A more straightforward solution would be using re.search()/MatchObject.end():
m = re.search('_\d{4}', x)
print x[:m.end(0)]
If you want to stick with split(), you can use a lookbehind:
re.split('(?<=_\d{4}).', x)
(This work even when the year is at the end of the string, because split() returns an array with the original string in case the delimiter isn't found.)
If its always going to be the same pattern, then why not:
>>> x = 'STARSHIP_TROOPERS_INVASION_2012_LOCDE'
>>> x[:x.rfind('_')]
'STARSHIP_TROOPERS_INVASION_2012'
For your original regular expression, since you aren't capturing the matched group, it is not part of your matches:
>>> re.split('_\d{4}',x)
['STARSHIP_TROOPERS_INVASION', '_LOCDE']
>>> re.split('_(\d{4})',x)
['STARSHIP_TROOPERS_INVASION', '2012', '_LOCDE']
The () marks the selection as a captured group:
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed, and can be matched later
in the string with the \number special sequence, described below. To
match the literals '(' or ')', use ( or ), or enclose them inside a
character class: [(] [)].
you may use both split() and search() supposing you have a single such date in your string you wish to split at.
import re
x='STARSHIP_TROOPERS_INVASION_2012_LOCDE'
date=re.search('_\d{4}',x).group(0)
print(date)
gives
>>>
_2012
and
print(re.split('_\d{4}',x)[0]+date)
gives
STARSHIP_TROOPERS_INVASION_2012

Categories