I have the following text:
s1 = 'Promo Tier 77 (4.89 USD)'
s2 = 'Promo (11.50 USD) Tier 1 Titles Only'
From this I want to pull out the number that is not included in the parenthetical. It would be:
s1 --> '77'
s2 --> '1'
I am currently using the weak regex re.findall('\s\d+\s',s1). What would be the correct regex? Something like re.findall('\d+',s1) but excluding anything within the parenthetical.
>>> re.findall('\d+',s1)
['77', '4', '89'] # two of these numbers are within the parenthetical.
# I only want '77'
One way that I find useful is to use the alternation operator in context placing what you want to exclude on the left side, (saying throw this away, it's garbage) and place what you want to match in a capturing group on the right side.
Then you can combine this with filter or use a list comprehension to remove the empty list items that the regular expression engine picks up from the expression on the left side of the alternation operator.
>>> import re
>>> s = """Promo (11.50 USD) Tier 1 Titles Only
Promo (11.50 USD) (10.50 USD, 11.50 USD) Tier 5
Promo Tier 77 (4.89 USD)"""
>>> filter(None, re.findall(r'\([^)]*\)|(\d+)', s))
['1', '5', '77']
You could make a temporary string that has the parenthesis section removed, then run your code. I used a space so that numbers before and after the missing string section can't be joined.
>>> import re
>>> s = 'Promo Tier 77 (11.50 USD) Tier 1 Titles Only'
>>> temp = re.sub(r'\(.*?\)', ' ', s)
Promo Tier 77 Tier 1 Titles Only
>>> re.findall('\d+', temp)
['77', '1']
And you could of course shorten this to a single line.
Do some splitting on your strings. eg pseudocode
s1 = "Promo Tier 77 (4.89 USD)"
s = s1.split(")")
for ss in s :
if "(" in ss: # check for the open brace
if the number in ss.split("(")[0]: # split at the open brace and do your regex
print the number
(\b\d+\b)(?=(?:[^()]*\([^)]*\))*[^()]*$)
Try this.Grab the capture.See demo.
http://regex101.com/r/gT6kI4/7
Related
I have the following string for which I want to extract data:
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n
Every variable I want to extract starts with \n
The value I want to get starts with a colon ':' followed by more than 1 dot
When it doesnt start with a colon followed by dots, I dont want to extract that value.
For example my preferred output looks like:
LOA = 189.9
LBP = 176.0
BM = 26.4
DM = 9.2
import re
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n'
# capture all the characters BEFORE the ':' character
variables = re.findall(r'(.*?):', text_example)
# matches all floats and integers (does not account for minus signs)
values = re.findall(r'(\d+(?:\.\d+)?)', text_example)
# zip into dictionary (this is assuming you will have the same number of results for both regex expression.
result = dict(zip(variables, values))
print(result)
--> {'Length oa': '189.9', 'Breadth moulded': '26.4', 'Length bp': '176', 'Depth moulded to main deck': '9.2'}
You can create a regex and workaround the solution-
re.findall(r'(\\n|\n)([A-Za-z\s]*)(?:(\:\s*\.+))(\d*\.*\d*)',text_example)[2]
('\n', 'Breadth moulded', ': .......', '26.4')
I have a list of names with different notations:
for example:
myList = [ab2000, abc2000_2000, AB2000, ab2000_1, ABC2000_01, AB2000_2, ABC2000_02, AB2000_A1]
the standarized version for those different notations are, for example:
'ab2000' is 'ABC2000'
'ab2000_1' is 'ABC2000_01'
'AB2000_A1' is 'ABC2000_A1'
What I tried is to separate the different characters of the string using compile.
input:
compiled = re.compile(r'[A-Za-z]+|\d+|\W+')
compiled.findall("AB2000_2000_A1")
output:
characters = ['AB', '2000', '2000', 'A', '1']
Then applying:
characters = list(set(characters))
To finally try to match the values of that list with the main components of the string: an alpha format followed by a digit format followed by an alphanumeric format.
But as you can see in the previous output I can't match 'A1' into a single character using \W+. My desired output is:
characters = ['AB', '2000', '2000', 'A1']
any idea to fix that?
o any better idea to solve my problem in general. Thank you, in advance.
Use the following pattern with optional groups and capturing groups:
r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?'
and re.I flag.
Note that (?:_([A-Z\d]+))? must be repeated in order to match both
third and fourth group. If you attempted to "repeat" this group, putting
it once with "*" it would match only the last group, skipping the third
group.
To test it, I ran the following test:
myList = ['ab2000', 'abc2000_2000', 'AB2000', 'ab2000_1', 'ABC2000_01',
'AB2000_2', 'ABC2000_02', 'AB2000_A1', 'AB2000_2000_A1']
pat = re.compile(r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?', re.I)
for tt in myList:
print(f'{tt:16} ', end=' ')
mtch = pat.match(tt)
if mtch:
for it in mtch.groups():
if it is not None:
print(f'{it:5}', end=' ')
print()
getting:
ab2000 ab 2000
abc2000_2000 abc 2000 2000
AB2000 AB 2000
ab2000_1 ab 2000 1
ABC2000_01 ABC 2000 01
AB2000_2 AB 2000 2
ABC2000_02 ABC 2000 02
AB2000_A1 AB 2000 A1
AB2000_2000_A1 AB 2000 2000 A1
I have a regular expression that matches the phone numbers:
import re
phones = re.findall(r'[+(]?[0-9][0-9 \-()]{8,}[0-9]', text)
It shows good accuracy in a large raw text dataset.
But sometimes it matches unwanted results (ranges of years and random IDs).
Ranges of years:
'2012 - 2017'
'(2011 - 2013'
'1999 02224'
'2019 2010-2015'
'2018-2018 (5'
'2004 -2009'
'1) 2005-2006'
'2011 2020'
Random ids:
'5 5 5 5'
'100032479008252'
'100006711277302'
I have ideas on how to solve these problems.
Limit the total number of digits to 12 digits.
Limit the total number of characters to 16 characters.
Remove the ranges of years (19**|20** - 19**|20**).
But I do not know how to implement these ideas and make them as exceptions in my regular expression.
Some examples that a regular expression should catch are presented below:
380-956-425979
+38(097)877-43-88
+38(050) 284-24-20
(097) 261-60-52
380-956-425979
(068)1850063
0975533222
I suggest you write different patterns for different phone strucutres. I'm not so sure about your phone number structures, but this matches your example:
import re
test = '''380-956-425979
+38(097)877-43-88
+38(050) 284-24-20
(097) 261-60-52
380-956-425979
(068)1850063
0975533222'''
solution = test.split("\n")
p1 = "\+?\d{3}\-\d{3}\-\d{6}"
p2 = "\+?(?:\d{2})?\(\d{3}\) ?\d{3}\-\d{2}\-\d{2}"
p3 = "\+?\d{3}\-\d{3}\-\d{6}"
p4 = "\+?(?:\(\d{3}\)|\d{3})\d{7}"
result = re.findall(f'{p1}|{p2}|{p3}|{p4}', test)
print(solution)
print(result)
You could do it in python directly:
if regex.match("condition", "teststring") and not regex.match("not-condition", "teststring"):
print("Match!")
I like regular expressions. I often find myself using multiple regex statements to narrow in on the value I need when trying to get a substring from a large block of text.
So far, my approach has been the following:
Use resultOfRegex1 = re.findall(firstRegex, myString) for my first regex
Check to see that resultOfRegex1[0] exists
Use resultOfRegex2 = re.findall(secondRegex, resultOfRegex1[0]) for
my second regex
Check to see that resultOfRegex2[0] exists, and print that value
But I feel like this is much more verbose and costly than it has to be. Is there an easier/faster way to match one regex and then match another regex based on the result of the first?
The whole point of groups is to allow extraction of subgroups from an overall match.
For example, instead two searches done the following fashion:
>>> import re
>>> s = 'The winning team scored 15 points and used only 2 timeouts'
>>> score_clause = re.search(r'scored \d+ point', s).group(0)
>>> re.search(r'\d+', score_clause).group(0)
'15'
Do a single search with a sub-group:
>>> re.search(r'scored (\d+) point', s).group(1)
'15'
One other thought: if you want to make decisions about whether to continue a findall-style search based on the first match, a reasonable choice would be to use re.finditer and extract values as needed:
>>> game_results = '''\
10 point victory: 1 in first period, 6 in second period, 3 in third period.
5 point victory: 0 in first period, 5 in second period, 0 in third period.
12 point victory: 5 in first period, 3 in second period, 4 in third period.
7 point victory: 3 in first period, 0 in second period, 4 in third period.
'''.splitlines()
>>> # Show period-by-period scores for games won by 8 or more points
>>> for game_result in game_results:
it = re.finditer(r'\d+', game_result)
if int(next(it).group(0)) >= 8:
print 'Big win:', [int(mo.group(0)) for mo in it]
Big win: [1, 6, 3]
Big win: [5, 3, 4]
I am trying to get all the digits from following string after the word classes (or its variations)
Accepted for all the goods and services in classes 16 and 41.
expected output:
16
41
I have multiple strings which follows this pattern and some others such as:
classes 5 et 30 # expected output 5, 30
class(es) 32,33 # expected output 32, 33
class 16 # expected output 5
Here is what I have tried so far: https://regex101.com/r/eU7dF6/3
(class[\(es\)]*)([and|et|,|\s]*(\d{1,}))+
But I am able to get only the last matched digit i.e. 41 in the above example.
I suggest grabbing all the substring with numbers after class or classes/class(es) and then get all the numbers from those:
import re
p = re.compile(r'\bclass(?:\(?es\)?)?(?:\s*(?:and|et|[,\s])?\s*\d+)+')
test_str = "Accepted for all the goods and services in classes 16 and 41."
results = [re.findall(r"\d+", x) for x in p.findall(test_str)]
print([x for l in results for x in l])
# => ['16', '41']
See IDEONE demo
As \G construct is not supported, nor can you access the captures stack using Python re module, you cannot use your approach.
However, you can do it the way you did with PyPi regex module.
>>> import regex
>>> test_str = "Accepted for all the goods and services in classes 16 and 41."
>>> rx = r'\bclass(?:\(?es\)?)?(?:\s*(?:and|et|[,\s])?\s*(?P<num>\d+))+'
>>> res = []
>>> for x in regex.finditer(rx, test_str):
res.extend(x.captures("num"))
>>> print res
['16', '41']
You can do it in 2 steps.Regex engine remebers only the last group in continous groups.
x="""Accepted for all the goods and services in classes 16 and 41."""
print re.findall(r"\d+",re.findall(r"class[\(es\)]*\s*(\d+(?:(?:and|et|,|\s)*\d+)*)",x)[0])
Output:['16', '41']
If you dont want string use
print map(ast.literal_eval,re.findall(r"\d+",re.findall(r"class[\(es\)]*\s*(\d+(?:(?:and|et|,|\s)*\d+)*)",x)[0]))
Output:[16, 41]
If you have to do it in one regex use regex module
import regex
x="""Accepted for all the goods and services in classes 16 and 41."""
print [ast.literal_eval(i) for i in regex.findall(r"class[\(es\)]*|\G(?:and|et|,|\s)*(\d+)",x,regex.VERSION1) if i]
Output:[16, 41]