How to get all characters upto a number? - python

I have strings like below
>>> s1
'this_is-a.string-123-with.number'
>>> s2
'this_is-a123.456string-123-with.number'
>>> s3
'one-0more-str.999'
need to get everything before all-numbers (not alphanumeric) after splitting, so get this_is-a.string- from s1 and this_is-a123.456string- from s2 and one-0more-str. from s3.
>>> for a in re.split('-|_|\.',s2):
... if a.isdigit():
... r=re.split(a,s2)[0]
... break
>>> print(r)
# expected: this_is-a123.456string-
# got: this_is-a
Above piece of code works for s1, but not for s2, as 123 matches a123 in s2, there should be a better pythonic way?
More info:
with s3 example, when we split with - or _ or . as delimiter, 999 is the only thing we get as all numbers, so everything before that is one-0more-str. which needs to be printed, if we take s2 as example, after splitting with dash or underbar or dot as delimiter, 123 will be the all number (isdigit), so get everything before that which is this_is-a123.456string-, so if input string is going to be this_1s-a-4.test, output should be this_1s-a-, because 4 is the all-number after splitting.

This will work for your example cases:
def fn(s):
return re.match("(.*?[-_.]|^)\d+([-_.]|$)", s).group(1)
(^ and $ match the beginning and end of the string respectively and the ? in .*? does a non-greedy match.)
Some more cases:
>>> fn("111")
""
>>> fn(".111")
"."
>>> fn(".1.11")
"."
You might also want to think about what you want to get if there is no group of all numbers:
>>> fn("foobar")

Not sure it will work in all cases but you can try:
for a in re.split('-|_|\.',s2).reverse():
if a.isdigit():
r=re.rsplit(a,s2)[0]
break
print(r)

This works for you examples
Code
def parse(s):
""" Splits on successive digits,
then takes everything up to last split on digits """
return ''.join(re.split(r'(\d+)', s)[:-2])
Tests
Using specified strings
for t in ['this_is-a.string-123-with.number',
'this_is-a123.456string-123-with.number',
'one-0more-str.999']:
print(f'{parse(t)}')
Output
this_is-a.string-
this_is-a123.456string-
one-0more-str.
Explanation
String
s = 'this_is-a123.456string-123-with.number'
Split on group of digits
re.split(r'(\d+)', s)
Out: ['this_is-a', '123', '.', '456', 'string-', '123', '-with.number']
Leave out last two items in split
re.split(r'(\d+)', s)[:-2] # [:-2] slice dropping last two items of list
Out: ['this_is-a', '123', '.', '456', 'string-']
Join list into string
''.join(re.split(r'(\d+)', s)[:-2]) # join items
Out: this_is-a123.456string-

If I understood correctly what you want, you can use a single regular expression to get the values you need:
import re
s1='this_is-a.string-123-with.number'
s2='this_is-a123.456string-123-with.number'
s3='one-0more-str.999'
# matches any group that is in between "all numbers"...
regex = re.compile('(.*[-\._])\d+([-\._].*)?')
m = regex.match(s1)
print(m.groups())
m = regex.match(s2)
print(m.groups())
m = regex.match(s3)
print(m.groups())
when you run this the result is the following:
('this_is-a.string-', '-with.number')
('this_is-a123.456string-', '-with.number')
('one-0more-str.', None)
If you are interested only in the first group you can use only:
>>> print(m.group(1))
one-0more-str.
If you want to filter for the cases where there is no second group:
>>> print([i for i in m.groups() if i])
['one-0more-str.']

Related

Substitute specific matches using regex

I want to execute substitutions using regex, not for all matches but only for specific ones. However, re.sub substitutes for all matches. How can I do this?
Here is an example.
Say, I have a string with the following content:
FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3
What I want to do is this:
re.sub(r'^BAR', '#BAR', s, index=[1,2], flags=re.MULTILINE)
to get the below result.
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could pass replacement function to re.sub that keeps track of count and checks if the given index should be substituted:
import re
s = '''FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3'''
i = 0
index = {1, 2}
def repl(x):
global i
if i in index:
res = '#' + x.group(0)
else:
res = x.group(0)
i += 1
return res
print re.sub(r'^BAR', repl, s, flags=re.MULTILINE)
Output:
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could
Split your string using s.splitlines()
Iterate over the individual lines in a for loop
Track how many matches you have found so far
Only perform substitutions on those matches in the numerical ranges you want (e.g. matches 1 and 2)
And then join them back into a single string (if need be).

Is there a way in Python to create a tuple of the components matched in a regular expression?

Is there a way in Python to create a tuple of the components matched in a regular expression?
For instance, this is what I am trying to do.
import re
pattern = '^[A-Z]{5} [0-9]{6}(C|P)[0-9]{1,3}$'
str = 'ABCDE 020816C110'
m = re.match(pattern,str)
print m.group()
ABCDE 020816C110
I want to make something that looks like ('ABCDE','020816','C','110') (based upon the parts within the regex)
and if my pattern is different, say,
pattern = ^[A-Z]{1,4} [A-Z]{2} [A-Z]$
str = 'ABC FH P'
I would eventually get ('ABC','FH','P')
It seems I have to split on components of the regex that will be different by pattern.
I am considering making n number of separate calls to re.search with only the component pattern, but I doubt I will always find the appropriate substring or it will return more than I want.
Use capturing groups:
>>> pattern = '^([A-Z]{5}) ([0-9]{6})(C|P)([0-9]{1,3})$'
>>> m = re.match(pattern, str)
>>> m.groups()
('ABCDE', '020816', 'C', '110')
Try:
>>> import re
>>> pattern = '^([A-Z]{5}) ([0-9]{6})(C|P)([0-9]{1,3})$'
>>> s = 'ABCDE 020816C110'
>>> m = re.match(pattern, s)
>>> m.groups()
('ABCDE', '020816', 'C', '110')
You can use groups, and match, for this you only have to add ( and ) in the correct places.

Need help extracting data from a file

I'm a newbie at python.
So my file has lines that look like this:
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
I need help coming up with the correct python code to extract every float preceded by a colon and followed by a space (ex: [-0.294118, 0.487437,etc...])
I've tried dataList = re.findall(':(.\*) ', str(line)) and dataList = re.split(':(.\*) ', str(line)) but these come up with the whole line. I've been researching this problem for a while now so any help would be appreciated. Thanks!
try this one:
:(-?\d\.\d+)\s
In your code that will be
p = re.compile(':(-?\d\.\d+)\s')
m = p.match(str(line))
dataList = m.groups()
This is more specific on what you want.
In your case .* will match everything it can
Test on Regexr.com:
In this case last element wasn't captured because it doesnt have space to follow, if this is a problem just remove the \s from the regex
This will do it:
import re
line = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
for match in re.finditer(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE):
print match.group(1)
Or:
match = re.search(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE)
if match:
datalist = match.group(1)
else:
datalist = ""
Output:
-0.294118
0.487437
0.180328
-0.292929
0.00149028
-0.53117
-0.0333333
Live Python Example:
http://ideone.com/DpiOBq
Regex Demo:
https://regex101.com/r/nR4wK9/3
Regex Explanation
(-?\d\.\d+)
Match the regex below and capture its match into backreference number 1 «(-?\d\.\d+)»
Match the character “-” literally «-?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character that is a “digit” (ASCII 0–9 only) «\d»
Match the character “.” literally «\.»
Match a single character that is a “digit” (ASCII 0–9 only) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Given:
>>> s='-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333.333'
With your particular data example, you can just grab the parts that would be part of a float with a regex:
>>> re.findall(r':([\d.-]+)', s)
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
You can also split and partition, which would be substantially faster:
>>> [e.partition(':')[2] for e in s.split() if ':' in e]
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
Then you can convert those to a float using try/except and map and filter:
>>> def conv(s):
... try:
... return float(s)
... except ValueError:
... return None
...
>>> filter(None, map(conv, [e.partition(':')[2] for e in s.split() if ':' in e]))
[-0.294118, 0.487437, 0.180328, -0.292929, -1.0, 0.00149028, -0.53117, -0.0333333]
A simple oneliner using list comprehension -
str = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
[float(s.split()[0]) for s in str.split(':')]
Note: this is simplest to understand (and pobably fastest) as we are not doing any regex evaluation. But this would only work for the particular case above. (eg. if you've to get the second number - in the above not so correctly formatted string would need more work than a single one-liner above).

Python regex to find only second quotes of paired quotes

I wondering if there is some way to find only second quotes from each pair in string, that has paired quotes.
So if I have string like '"aaaaa"' or just '""' I want to find only the last '"' from it. If I have '"aaaa""aaaaa"aaaa""' I want only the second, fourth and sixth '"'s. But if I have something like this '"aaaaaaaa' or like this 'aaa"aaa' I don't want to find anything, since there are no paired quotes. If i have '"aaa"aaa"' I want to find only second '"', since the third '"' has no pair.
I've tried to implement lookbehind, but it doesn't work with quantifiers, so my bad attempt was '(?<=\"a*)\"'.
You don't really need regex for this. You can do:
[i for i, c in enumerate(s) if c == '"'][1::2]
To get the index of every other '"'. Example usage:
>>> for s in ['"aaaaa"', '"aaaa""aaaaa"aaaa""', 'aaa"aaa', '"aaa"aaa"']:
print(s, [i for i, c in enumerate(s) if c == '"'][1::2])
"aaaaa" [6]
"aaaa""aaaaa"aaaa"" [5, 12, 18]
aaa"aaa []
"aaa"aaa" [4]
import re
reg = re.compile(r'(?:\").*?(\")')
then
for match in reg.findall('"this is", "my test"'):
print(match)
gives
"
"
If your necessity is to change the second quote you can also match the whole string and put the pattern before the second quote into a capture group. Then making the substitution by the first match group + the substitution string would archive the issue.
For example, this regex will match everything before the second quote and put it into a group
(\"[^"]*)\"
if you replace whole the match (which includes the second quote) by only the value of the capture group (which does not include the second quote), then you would just cut it off.
See the online example
import re
p = re.compile(ur'(\"[^"]*)\"')
test_str = u"\"test1\"test2\"test3\""
subst = r"\1"
result = re.sub(p, subst, test_str)
print result #result -> "test1test2"test3
Please read my answer about why you don't want to use regular expressions for such a problem, even though you can do that kind of non-regular job with it.
Ok then you probably want one of the solutions I give in the linked answer, where you'll want to use a recursive regex to match all the matching pairs.
Edit: the following has been written before the update to the question, which was asking only for second double quotes.
Though if you want to find only second double quotes in a string, you do not need regexps:
>>> s1='aoeu"aoeu'
>>> s2='aoeu"aoeu"aoeu'
>>> s3='aoeu"aoeu"aoeu"aoeu'
>>> def find_second_quote(s):
... pos_quote_1 = s2.find('"')
... if pos_quote_1 == -1:
... return -1
... pos_quote_2 = s[pos_quote_1+1:].find('"')
... if pos_quote_2 == -1:
... return -1
... return pos_quote_1+1+pos_quote_2
...
>>> find_second_quote(s1)
-1
>>> find_second_quote(s2)
4
>>> find_second_quote(s3)
4
>>>
here it either returns -1 if there's no second quote, or the position of the second quote if there is one.
a parser is probably better, but depending on what you want to get out of it, there are other ways. if you need the data between the quotes:
import re
re.findall(r'".*?"', '"aaaa""aaaaa"aaaa""')
['"aaaa"',
'"aaaaa"',
'""']
if you need the indices, you could do it as a generator or other equivalent like this:
def count_quotes(mystr):
count = 0
for i, x in enumerate(mystr):
if x == '"':
count += 1
if count % 2 == 0:
yield i
list(count_quotes('"aaaa""aaaaa"aaaa""'))
[5, 12, 18]

Python RegEx search and replace with part of original expression

I'm new to Python and looking for a way to replace all occurrences of "[A-Z]0" with the [A-Z] portion of the string to get rid of certain numbers that are padded with a zero. I used this snippet to get rid of the whole occurrence from the field I'm processing:
import re
def strip_zeros(s):
return re.sub("[A-Z]0", "", s)
test = strip_zeros(!S_fromManhole!)
How do I perform the same type of procedure but without removing the leading letter of the "[A-Z]0" expression?
Thanks in advance!
Use backreferences.
http://www.regular-expressions.info/refadv.html "\1 through \9 Substituted with the text matched between the 1st through 9th pair of capturing parentheses."
http://docs.python.org/2/library/re.html#re.sub "Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern."
Untested, but it would look like this:
return re.sub(r"([A-Z])0", r"\1", s)
Placing the first letter inside a capture group and referencing it with \1
you can try something like
In [47]: s = "ab0"
In [48]: s.translate(None, '0')
Out[48]: 'ab'
In [49]: s = "ab0zy"
In [50]: s.translate(None, '0')
Out[50]: 'abzy'
I like Patashu's answer for this case but for the sake of completeness, passing a function to re.sub instead of a replacement string may be cleaner in more complicated cases. The function should take a single match object and return a string.
>>> def strip_zeros(s):
... def unpadded(m):
... return m.group(1)
... return re.sub("([A-Z])0", unpadded, s)
...
>>> strip_zeros("Q0")
'Q'

Categories