How to extract a part of a string

How to extract a part of a string - python

I have this string:
-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)
but actually I have a lot of string like this:
a*p**(-1.0) + b*p**(c)
where a,b and c are double. And I would like to extract a,b and c of this string. How can I do this using Python?

import re
s = '-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)'
pattern = r'-?\d+\.\d*'
a,_,b,c = re.findall(pattern,s)
print(a, b, c)
Output
('-1007.88670550662', '67293.8347365694', '-0.416543501823503')
s is your test strings and what not, pattern is the regex pattern, we are looking for floats, and once we find them using findall() we assign them back to a,b,c
Note this method works only if your string is in format of what you've given. else you can play with the pattern to match what you want.
Edit like most people stated in the comments if you need to include a + in front of your positive numbers you can use this pattern r'[-+]?\d+\.\d*'

Using the reqular expression
(-?\d+\.?\d*)\*p\*\*\(-1\.0\)\s*\+\s*(-?\d+\.?\d*)\*p\*\*\((-?\d+\.?\d*)\)
We can do
import re
pat = r'(-?\d+\.?\d*)\*p\*\*\(-1\.0\)\s*\+\s*(-?\d+\.?\d*)\*p\*\*\((-?\d+\.?\d*)\)'
regex = re.compile(pat)
print(regex.findall('-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)'))
will print [('-1007.88670550662', '67293.8347365694', '-0.416543501823503')]

If your formats are consistent, and you don't want to deep dive into regex (check out regex101 for this, btw) you could just split your way through it.
Here's a start:
>>> s= "-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)"
>>> a, buf, c = s.split("*p**")
>>> b = buf.split()[-1]
>>> a,b,c
('-1007.88670550662', '67293.8347365694', '(-0.416543501823503)')
>>> [float(x.strip("()")) for x in (a,b,c)]
[-1007.88670550662, 67293.8347365694, -0.416543501823503]

The re module can certainly be made to work for this, although as some of the comments on the other answers have pointed out, the corner cases can be interesting -- decimal points, plus and minus signs, etc. It could be even more interesting; e.g. can one of your numbers be imaginary?
Anyway, if your string is always a valid Python expression, you can use Python's built-in tools to process it. Here is a good generic explanation about the ast module's NodeVisitor class. To use it for your example is quite simple:
import ast
x = "-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)"
def getnums(s):
result = []
class GetNums(ast.NodeVisitor):
def visit_Num(self, node):
result.append(node.n)
def visit_UnaryOp(self, node):
if (isinstance(node.op, ast.USub) and
isinstance(node.operand, ast.Num)):
result.append(-node.operand.n)
else:
ast.NodeVisitor.generic_visit(self, node)
GetNums().visit(ast.parse(s))
return result
print(getnums(x))
This will return a list with all the numbers in your expression:
[-1007.88670550662, -1.0, 67293.8347365694, -0.416543501823503]
The visit_UnaryOp method is only required for Python 3.x.

You can use something like:
import re
a,_,b,c = re.findall(r"[\d\-.]+", subject)
print(a,b,c)
Demo

While I prefer MooingRawr's answer as it is simple, I would extend it a bit to cover more situations.
A floating point number can be converted to string with surprising variety of formats:
Exponential format (eg. 2.0e+07)
Without leading digit (eg. .5, which is equal to 0.5)
Without trailing digit (eg. 5., which is equal to 5)
Positive numbers with plus sign (eg. +5, which is equal to 5)
Numbers without decimal part (integers) (eg. 0 or 5)
Script
import re
test_values = [
'-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)',
'-2.000e+07*p**(-1.0) + 1.23e+07*p**(-5e+07)',
'+2.*p**(-1.0) + -1.*p**(5)',
'0*p**(-1.0) + .123*p**(7.89)'
]
pattern = r'([-+]?\.?\d+\.?\d*(?:[eE][-+]?\d+)?)'
for value in test_values:
print("Test with '%s':" % value)
matches = re.findall(pattern, value)
del matches[1]
print(matches, end='\n\n')
Output:
Test with '-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)':
['-1007.88670550662', '67293.8347365694', '-0.416543501823503']
Test with '-2.000e+07*p**(-1.0) + 1.23e+07*p**(-5e+07)':
['-2.000e+07', '1.23e+07', '-5e+07']
Test with '+2.*p**(-1.0) + -1.*p**(5)':
['+2.', '-1.', '5']
Test with '0*p**(-1.0) + .123*p**(7.89)':
['0', '.123', '7.89']

Related

Replacing sub-string occurrences with elements of a given list

Suppose I have a string that has the same sub-string repeated multiple times and I want to replace each occurrence with a different element from a list.
For example, consider this scenario:
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
The goal is to obtain a string of the form:
s = "a(_001_), b(_002_), c(_003_)"
The number of occurrences is known, and the list r has the same length as the number of occurrences (3 in the previous example) and contains increasing integers starting from 0.
I've came up with this solution:
import re
pattern = "_____"
s = "a(_____), b(_____), c(_____)"
l = [m.start() for m in re.finditer(pattern, s)]
i = 0
for el in l:
s = s[:el] + f"_{str(i).zfill(5 - 2)}_" + s[el + 5:]
i += 1
print(s)
Output: a(_000_), b(_001_), c(_002_)
This solves my problem, but it seems to me a bit cumbersome, especially the for-loop. Is there a better way, maybe more "pythonic" (intended as concise, possibly elegant, whatever it means) to solve the task?

You can simply use re.sub() method to replace each occurrence of the pattern with a different element from the list.
import re
pattern = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0,1,2]
for val in r:
s = re.sub(pattern, f"_{val:03d}_", s, count=1)
print(s)
You can also choose to go with this approach without re using the values in the r list with their indexes respectively:
r = [0,1,2]
s = ", ".join(f"{'abc'[i]}(_{val:03d}_)" for i, val in enumerate(r))
print(s)
a(_000_), b(_001_), c(_002_)

TL;DR
Use re.sub with a replacement callable and an iterator:
import re
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
it = iter(r)
print(re.sub(p, lambda _: f"_{next(it):03d}_", s))
Long version
Generally speaking, it is a good idea to re.compile your pattern once ahead of time. If you are going to use that pattern repeatedly later, this makes the regex calls much more efficient. There is basically no downside to compiling the pattern, so I would just make it a habit.
As for avoiding the for-loop altogether, the re.sub function allows us to pass a callable as the repl argument, which takes a re.Match object as its only argument and returns a string. Wouldn't it be nice, if we could have such a replacement function that takes the next element from our replacements list every time it is called?
Well, since you have an iterable of replacement elements, we can leverage the iterator protocol to avoid explicit looping over the elements. All we need to do is give our replacement function access to an iterator over those elements, so that it can grab a new one via the next function every time it is called.
The string format specification that Jamiu used in his answer is great if you know exactly that the sub-string to be replaced will always be exactly five underscores (_____) and that your replacement numbers will always be < 999.
So in its simplest form, a function doing what you described, could look like this:
import re
from collections.abc import Iterable
def multi_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(_match: re.Match[str]) -> str:
return f"_{next(iterator):03d}_"
return re.sub(pattern, repl, string)
Trying it out with your example data:
if __name__ == "__main__":
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
print(multi_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_)
In this simple application, we aren't doing anything with the Match object in our replacement function.
If you want to make it a bit more flexible, there are a few avenues possible. Let's say the sub-strings to replace might (perhaps unexpectedly) be a different number of underscores. Let's further assume that the numbers might get bigger than 999.
First of all, the pattern would need to change a bit. And if we still want to center the replacement in an arbitrary number of underscores, we'll actually need to access the match object in our replacement function to check the number of underscores.
The format specifiers are still useful because the allow centering the inserted object with the ^ align code.
import re
from collections.abc import Iterable
def dynamic_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(match: re.Match[str]) -> str:
replacement = f"{next(iterator):03d}"
length = len(match.group())
return f"{replacement:_^{length}}"
return re.sub(pattern, repl, string)
if __name__ == "__main__":
p = re.compile("(_+)")
s = "a(_____), b(_____), c(_____), d(_______), e(___)"
r = [0, 1, 2, 30, 4000]
print(dynamic_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_), d(__030__), e(4000)
Here we are building the replacement string based on the length of the match group (i.e. the number of underscores) to ensure it the number is always centered.
I think you get the idea. As always, separation of concerns is a good idea. You can put the replacement logic in its own function and refer to that, whenever you need to adjust it.

i dun see regex best suit the situation.
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
fstring = s.replace(pattern, "_{}_")
str_out = fstring.format(*r)
str_out_pad = fstring.format(*[str(entry).zfill(3) for entry in r])
print(str_out)
print(str_out_pad)
--
a(_0_), b(_1_), c(_2_)
a(_000_), b(_001_), c(_002_)

Python regex that repeats \d number of times

Using python regex, I am trying to match as many number of p as the the digit first matched in pattern.
Sample Input
1pp
2p
3ppp
4ppppppppp
Expected Output
1p
None
3ppp
4pppp
Code Tried
I have tried the following code, where i use named group, and give the name 'dig' to the matched digit, now I want to use dig in repetition {m}. But the following code does not find any match in pattern.
pattern = "2pppp"
reTriple = '((?P<dig>\d)p{(?P=dig)})'
regex = re.compile(reTriple,re.IGNORECASE)
matches = re.finditer(regex,pattern)
I think the problem is that repetition {m} expects an int m, where as dig is a string. But I can't find a way to concatenate an int to string while keeping it int! I tried casting as follows:
reTrip = '((?P<dig>\d)p{%d}'%int('(?P=dig)')+')'
But I get the following error:
ValueError: invalid literal for int() with base 10: '(?P=dig)'
I feel stuck. Can someone please guide.
And its weird that if i instead break reTriple as follows: save the matched digit in a variable first and then concatenate this variable in reTriple, it works, and the expected output is achieved. But this is a work around, and I am looking for a better method.
reTriple = '(?P<dig>\d)'
dig = re.search(reTriple , pattern).group('dig')
reTriple = reTriple + '(p{1,' + dig + '})'

It seems that what you are trying basically comes down to: (\d+)p{\1} where you would use capture group 1 as input for how often you need to match "p". However capture group one seems to be returned as text (not numeric) causing you to find no results. Have a look here for example.
Maybe it helps to split this into two operations. For example:
import re
def val_txt(txt):
i = int(re.search(r'\d+', txt).group(0))
fnd = re.compile(fr'(?i)\d+p{{{i}}}')
if fnd.search(txt):
return fnd.search(txt).group(0)
print(val_txt('2p'))

You can also do pure string operations without depending on any module for the mentioned strings in the question (digits < 10):
def val_txt(txt):
dig = int(txt[0])
rest_val = 'p' * dig
return f'{dig}{rest_val}' if txt[1:1+dig] == rest_val else None
print(val_txt('1ppp'))
# 1p

Hi you can do another approach something like this without regex:
from typing import Union
def test(txt: str, var: str ='p') -> Union[str, None]:
var_count = txt.count(var)
number = int(txt[0:len(txt) - var_count:])
if number <= var_count:
return f'{number}{number * var}'
return None
lets test it
output:
t = ['1pp', '2p', '3ppp', '4ppppppppp', '10pppppppppp']
for i in t:
print(test(i))
1p
None
3ppp
4pppp
10pppppppppp

Here's a single step regex solution which uses a lambda function to check if there are sufficient p's to match the digits at the beginning of the string; if there are it returns the appropriate string (e.g. 1p or 3ppp), otherwise it returns an empty string:
import re
strs = ['1pp',
'2p',
'3ppp',
'4ppppppppp'
]
for s in strs:
print(re.sub(r'^(\d+)(p+).*', lambda m: m.group(1) + m.group(2)[:int(m.group(1))] if len(m.group(2)) >= int(m.group(1)) else '', s))
Output:
1p
3ppp
4pppp

How to convert engineering notation of numbers to scientific in equation using python

I have equation strings where all numbest in engineering notation like:
"(10u*myvar1)+(2.5f*myvar2)/myvar3"
I need to convert all numbers in this equation strings to scientific notation so that result will be like like:
"(10e-6*myvar1)+(2.5e-15*myvar2)/myvar3"
Do anyone have idea how to make this simple?
The hard way I think to split this with re.findall to numbers and other things, than fix numbers and rejoin to string. Like:
vals=re.findall('[\d.\w]+',param_value) #all numbers
operators=re.findall('[^\d.\w]+',param_value) #all not numbers
And than work on this two lists. But it seems too complicated. I don't see simple way to join this two lists back to string.

You can do a simple regex substitution:
>>> units = 'munpf'
>>> def f(match):
num = match.group(0)
exp = -3 * (units.index(num[-1]) + 1)
return num[:-1] + '10e' + str(exp)
>>> expr = "(10u*myvar1)+(2.5f*myvar2)/myvar3"
>>> re.sub(r'\b\d+(\.\d*)?' + '[%s]' % units + r'\b', f, expr)
'(10e-6*myvar1)+(2.5e-15*myvar2)/myvar3'
It's easy to extend if you want.

python string manipulation and processing

I have a number of codes which I need to process, and these come through in a number of different formats which I need to manipulate first to get them in the right format:
Examples of codes:
ABC1.12 - correct format
ABC 1.22 - space between letters and numbers
ABC1.12/13 - 2 codes joined together and leading 1. missing from 13, should be ABC1.12 and ABC1.13
ABC 1.12 / 1.13 - codes joined together and spaces
I know how to remove the spaces but am not sure how to handle the codes which have been split. I know I can use the split function to create 2 codes but not sure how I can then append the letters (and first number part) to the second code. This is the 3rd and 4th example in the list above.
WHAT I HAVE SO FAR
val = # code
retList = [val]
if "/" in val:
(code1, code2) = session_codes = val.split("/", 1)
(inital_letters, numbers) = code1.split(".", 1)
if initial_letters not in code2:
code2 = initial_letters + '.' + code2
# reset list so that it returns both values
retList = [code1, code2]
This won't really handle the splits for 4 as the code2 becomes ABC1.1.13

You can use regex for this purpose
A possible implementation would be as follows
>>> def foo(st):
parts=st.replace(' ','').split("/")
parts=list(re.findall("^([A-Za-z]+)(.*)$",parts[0])[0])+parts[1:]
parts=parts[0:1]+[x.split('.') for x in parts[1:]]
parts=parts[0:1]+['.'.join(x) if len(x) > 1 else '.'.join([parts[1][0],x[0]]) for x in parts[1:]]
return [parts[0]+p for p in parts[1:]]
>>> foo('ABC1.12')
['ABC1.12']
>>> foo('ABC 1.22')
['ABC1.22']
>>> foo('ABC1.12/13')
['ABC1.12', 'ABC1.13']
>>> foo('ABC 1.12 / 1.13')
['ABC1.12', 'ABC1.13']
>>>

Are you familiar with regex? That would be an angle worth exploring here. Also, consider splitting on the space character, not just the slash and decimal.

I suggest you write a regular expression for each code pattern and then form a larger regular expression which is the union of the individual ones.

Using PyParsing
The answer by #Abhijit is a good, and for this simple problem reg-ex may be the way to go. However, when dealing with parsing problems, you'll often need a more extensible solution that can grow with your problem. I've found that pyparsing is great for that, you write the grammar it does the parsing:
from pyparsing import *
index = Combine(Word(alphas))
# Define what a number is and convert it to a float
number = Combine(Word(nums)+Optional('.'+Optional(Word(nums))))
number.setParseAction(lambda x: float(x[0]))
# What do extra numbers look like?
marker = Word('/').suppress()
extra_numbers = marker + number
# Define what a possible line could be
line_code = Group(index + number + ZeroOrMore(extra_numbers))
grammar = OneOrMore(line_code)
From this definition we can parse the string:
S = '''ABC1.12
ABC 1.22
XXX1.12/13/77/32.
XYZ 1.12 / 1.13
'''
print grammar.parseString(S)
Giving:
[['ABC', 1.12], ['ABC', 1.22], ['XXX', 1.12, 13.0, 77.0, 32.0], ['XYZ', 1.12, 1.13]]
Advantages:
The number is now in the correct format, as we've type-casted them to floats during the parsing. Many more "numbers" are handled, look at the index "XXX", all numbers of type 1.12, 13, 32. are parsed, irregardless of decimal.

Take a look at this method. The might be the simple and yet best way to do.
val = unicode(raw_input())
for aChar in val:
if aChar.isnumeric():
lastIndex = val.index(aChar)
break
part1 = val[:lastIndex].strip()
part2 = val[lastIndex:]
if "/" not in part2:
print part1+part2
else:
if " " not in part2:
codes = []
divPart2 = part2.split(".")
partCodes = divPart2[1].split("/")
for aPart in partCodes:
codes.append(part1+divPart2[0]+"."+aPart)
print codes
else:
codes = []
divPart2 = part2.split("/")
for aPart in divPart2:
aPart = aPart.strip()
codes.append(part1+aPart)
print codes

Add 'decimal-mark' thousands separators to a number

How do I format 1000000 to 1.000.000 in Python? where the '.' is the decimal-mark thousands separator.

If you want to add a thousands separator, you can write:
>>> '{0:,}'.format(1000000)
'1,000,000'
But it only works in Python 2.7 and above.
See format string syntax.
In older versions, you can use locale.format():
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
'en_AU.utf8'
>>> locale.format('%d', 1000000, 1)
'1,000,000'
the added benefit of using locale.format() is that it will use your locale's thousands separator, e.g.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'de_DE.utf-8')
'de_DE.utf-8'
>>> locale.format('%d', 1000000, 1)
'1.000.000'

I didn't really understand it; but here is what I understand:
You want to convert 1123000 to 1,123,000. You can do that by using format:
http://docs.python.org/release/3.1.3/whatsnew/3.1.html#pep-378-format-specifier-for-thousands-separator
Example:
>>> format(1123000,',d')
'1,123,000'

Just extending the answer a bit here :)
I needed to both have a thousandth separator and limit the precision of a floating point number.
This can be achieved by using the following format string:
> my_float = 123456789.123456789
> "{:0,.2f}".format(my_float)
'123,456,789.12'
This describes the format()-specifier's mini-language:
[[fill]align][sign][#][0][width][,][.precision][type]
Source: https://www.python.org/dev/peps/pep-0378/#current-version-of-the-mini-language

An idea
def itanum(x):
return format(x,',d').replace(",",".")
>>> itanum(1000)
'1.000'

Strange that nobody mentioned a straightforward solution with regex:
import re
print(re.sub(r'(?<!^)(?=(\d{3})+$)', r'.', "12345673456456456"))
Gives the following output:
12.345.673.456.456.456
It also works if you want to separate the digits only before comma:
re.sub(r'(?<!^)(?=(\d{3})+,)', r'.', "123456734,56456456")
gives:
123.456.734,56456456
the regex uses lookahead to check that the number of digits after a given position is divisible by 3.
Update 2021: Please use this for scripting only (i.e. only in situation where you can destroy the code after using it). When used in an application, this approach would constitute a ReDoS.

Using itertools can give you some more flexibility:
>>> from itertools import zip_longest
>>> num = "1000000"
>>> sep = "."
>>> places = 3
>>> args = [iter(num[::-1])] * places
>>> sep.join("".join(x) for x in zip_longest(*args, fillvalue=""))[::-1]
'1.000.000'

Drawing on the answer by Mikel, I implemented his solution like this in my matplotlib plot. I figured some might find it helpful:
ax=plt.gca()
ax.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, loc: locale.format('%d', x, 1)))

DIY solution
def format_number(n):
result = ""
for i, digit in enumerate(reversed(str(n))):
if i != 0 and (i % 3) == 0:
result += ","
result += digit
return result[::-1]
built-in solution
def format_number(n):
return "{:,}".format(n)

Here's only a alternative answer.
You can use split operator in python and through some weird logic
Here's the code
i=1234567890
s=str(i)
str1=""
s1=[elm for elm in s]
if len(s1)%3==0:
for i in range(0,len(s1)-3,3):
str1+=s1[i]+s1[i+1]+s1[i+2]+"."
str1+=s1[i]+s1[i+1]+s1[i+2]
else:
rem=len(s1)%3
for i in range(rem):
str1+=s1[i]
for i in range(rem,len(s1)-1,3):
str1+="."+s1[i]+s1[i+1]+s1[i+2]
print str1
Output
1.234.567.890

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a part of a string - python

I have this string: -1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503) but actually I have a lot of string like this: a*p**(-1.0) + b*p**(c) where a,b and c are double. And I would like to extract a,b and c of this string. How can I do this using Python?

You can use something like: import re a,_,b,c = re.findall(r"[\d\-.]+", subject) print(a,b,c) Demo

Related

Replacing sub-string occurrences with elements of a given list

Python regex that repeats \d number of times

How to convert engineering notation of numbers to scientific in equation using python

python string manipulation and processing

Add 'decimal-mark' thousands separators to a number

Categories

Resources