What should be the regex for strings? - python

I'm trying to write a Regex for strings -
c190_12_1-10
c129
abc_1-90
to separate to -
['c190_12_', '1', '10']
['c', '129']
['abc_', '1', '90']
So far I've came up with (\D+)(\d+)-?(\d+)?
But, it doesn't work for all combinations. What I am missing here?

You can use this:
items = ['c190_12_1-10', 'c129', 'abc_1-90']
reg = re.compile(r'^(.+?)(\d+)(?:-(\d+))?$')
for item in items:
m = reg.match(item)
print m.groups()

Not sure what exactly you do and don't want to match, but this might work for you:
(?:(\w+)(\d+)-|([a-z]+))(\d+)$
http://regex101.com/r/uA3eZ4
The secret here was working backwards form the end, where it always seems to be the same condition. Then using the conditionals and the non-capture group, you end up with the result you've shown.

Related

Regular expression for phone number check does not work

I'm trying to use a regular expression for checking phone numbers.
Below is the code I'm using:
phnum ='1-234-567-8901'
pattern = re.search('^\+?\d{0,3}\s?\(?\d{3}\)?[-.\s]?d{3}[-.\s]?d{4}$',phnum,re.IGNORECASE)
print(pattern)
Even for simple numbers it does not seem to work. Can anyone please correct me where am going wrong?
Here's a potential solution. I'm not great at regex, so I may be missing something.
import re
phone_pattern = re.compile(r"^(\+?\d{0,2}-)?(\d{3})-(\d{3})-(\d{4})$")
phone_numbers = ["123-345-6134",
"1-234-567-8910",
"+01-235-235-2356",
"123-123-123-123",
"1-asd-512-1232",
"a-125-125-1255",
"234-6721"]
for num in phone_numbers:
print(phone_pattern.findall(num))
Output:
[('', '123', '345', '6134')]
[('1-', '234', '567', '8910')]
[('+01-', '235', '235', '2356')]
[]
[]
[]
[]
The immediate problem is that you are missing the \ before the last two d:s. Furthermore, the first \s obviously does not match a dash.
I would also strongly encourage r'...' raw strings for all regexes, to avoid having Python's string parser from evaluating some backslash sequences before they reach the regex engine.
phnum ='1-234-567-8901'
pattern = re.search(
r'^\+?\d{0,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$',
phnum, re.IGNORECASE)
print(pattern)
Demo: https://ideone.com/EYtTKZ
More fundamentally, perhaps you should only accept a closing parenthesis if there is an opening parenthesis before it, etc. A common approach is to normalize number sequences before attempting to use them as phone numbers, but it's a bit of a chicken and egg problem. (You don't want to get false positives on large numbers or IP addresses, for example.)

How to put a condition in regex in python?

I have a regex like --
query = "(A((hh)|(hn)|(n))?)"
and an input inp = "Ahhwps edAn". I want to extract all the matched pattern along with unmatched(remaining) but with preserving order of the input.
The output should look like -- ['Ahh', 'wps ed', 'An'] or ['Ahh', 'w', 'p', 's', ' ', 'e', 'd', 'An'].
I had searched online but found nothing.
How can I do this?
The re.split method may output captured submatches in the resulting array.
Capturing groups are those constructs that are formed with a pair of unescaped parentheses. Your pattern abounds in redundant capturing groups, and re.split will return all of them. You need to remove those unnecessary ones, and convert all capturing groups to non-capturing ones, and just keep the outer pair of parentheses to make the whole pattern a single capturing group.
Use
re.split(r'(A(?:hh|hn|n)?)', s)
Note that there may be an empty element in the output list. Just use filter(None, result) to get rid of the empty values.
The match objects' span() method is really useful for what you're after.
import re
pat = re.compile("(A((hh)|(hn)|(n))?)")
inp = "Ahhwps edAn"
result=[]
i=k=0
for m in re.finditer(pat,inp):
j,k=m.span()
if i<j:
result.append(inp[i:j])
result.append(inp[j:k])
i=k
if i<len(inp):
result.append(inp[k:])
print result
Here's what the output looks like.
['Ahh', 'wps ed', 'An']
This technique handles any non-matching prefix and suffix text as well. If you use an inp value of "prefixAhhwps edAnsuffix", you'll get the output I think you'd want:
['prefix', 'Ahh', 'wps ed', 'An', 'suffix']
You can try this:
import re
import itertools
new_data = list(itertools.chain.from_iterable([re.findall(".{"+str(len(i)/2)+"}", i) for i in inp.split()]))
Output:
['Ahh', 'wps', 'ed', 'An']

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Splitting string in Python starting from the first numeric character

I need to manage string in Python in this way:
I have this kind of strings with '>=', '=', '<=', '<', '>' in front of them, for example:
'>=1_2_3'
'<2_3_2'
what I want to achieve is splitting the strings to obtain, respectively:
'>=', '1_2_3'
'<', '2_3_2'
basically I need to split them starting from the first numeric character.
There's a way to achieve this result with regular expressions without iterating over the string checking if a character is a number or a '_'?
thank you.
This will do:
re.split(r'(^[^\d]+)', string)[1:]
Example:
>>> re.split(r'(^[^\d]+)', '>=1_2_3')[1:]
['>=', '1_2_3']
>>> re.split(r'(^[^\d]+)', '<2_3_2')[1:]
['<', '2_3_2']
import re
strings = ['>=1_2_3','<2_3_2']
for s in strings:
mat = re.match(r'([^\d]*)(\d.*)', s)
print mat.groups()
Outputs:
('>=', '1_2_3')
('<', '2_3_2')
This just groups everything up until the first digit in one group, then that first digit and everything after into a second.
You can access the individual groups with mat.group(1), mat.group(2)
You can split using this regex:
(?<=[<>=])(?=\d)
RegEx Demo
There's probably a better way but you can split with a capture then join the second two elements:
values = re.split(r'(\d)', '>=1_2_3', maxsplit = 1)
values = [values[0], values[1] + values[2]]

What would be a good regular expression to match out indexes from a string variable?

I have a string declared as:
String testString = "1: 2, 3\n2: 3, 4\n3: 1\n4: 2\n5: 6, 7, 8"
I want to write a regular expression which can return me the indexes, i.e., which returns me the following list(or any collection):
[1,2,3,4,5]
I know if I write my regular expression as:
regex = r'[0-9]:'
I will get:
[1:,2:,3:,4:,5:]
However, this is not the desired output. I've just started with regular expressions, and I've applied all I know (so far) to this problem, however, I'm unable to write a valid regular expression. Help would be appreciated, thanks.
You can use r'[0-9]+(?=:)', which uses a lookahead
regx = re.compile('(?<!\d)\d+(?=\s*:)')
You could use your regex, then use a list comprehension to remove the trailing colons:
>>> [a.strip(':') for a in re.findall(r"[0-9]:", testString)]
['1', '2', '3', '4', '5']

Categories