Extract string with specific format

Extract string with specific format - python

I'm novice to Python and I am trying to extract a string from another string with specific format, for example:
I have original string: -
--#$_ABC1234-XX12X
I need to extract exactly the string ABC1234 (must include three first characters and followed by four digits).

You can use the curly brace repetition qualifiers {} to match exactly three alphabetic characters and exactly four numeric characters:
>>> from re import search
>>>
>>> string = '---#$_ABC1234-XX12X'
>>> match = search('[a-zA-Z]{3}\d{4}', string)
>>> match
<_sre.SRE_Match object; span=(6, 13), match='ABC1234'>
>>> match.group(0) # Use this to get the string that was matched.
'ABC1234'
Explanation of regex:
[a-zA-Z]: Match any letter upper case of lower case...
{3}: Exactly three times. And...
\d: Any digit character...
{4} Exactly four times.

You can make use of re module in Python
matcher = re.search((?P<matched_string>[a-zA-Z]{3}\d{4}))
needed_string = matcher.groupdict()['matched_string']
needed_string will be your desired output.
For the re module refer to: https://docs.python.org/3.4/library/re.html

If you now the exact coordinates of the string you can use something like this:
>>> var = "--#$_ABC1234-XX12X"
>>> newstring = var[5:12]
>>> newstring
'ABC1234'
a python string has a slice method.

Related

Python Regex matching already matched sub-string

I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?

It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.

The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.

1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

Python match a string with regex [duplicate]

This question already has answers here:
What exactly do "u" and "r" string prefixes do, and what are raw string literals?
(7 answers)
What exactly is a "raw string regex" and how can you use it?
(7 answers)
Closed 7 months ago.
I need a python regular expression to check if a word is present in a string. The string is separated by commas, potentially.
So for example,
line = 'This,is,a,sample,string'
I want to search based on "sample", this would return true. I am crappy with reg ex, so when I looked at the python docs, I saw something like
import re
re.match(r'sample', line)
But I don't know why there was an 'r' before the text to be matched. Can someone help me with the regular expression?

Are you sure you need a regex? It seems that you only need to know if a word is present in a string, so you can do:
>>> line = 'This,is,a,sample,string'
>>> "sample" in line
True

The r makes the string a raw string, which doesn't process escape characters (however, since there are none in the string, it is actually not needed here).
Also, re.match matches from the beginning of the string. In other words, it looks for an exact match between the string and the pattern. To match stuff that could be anywhere in the string, use re.search. See a demonstration below:
>>> import re
>>> line = 'This,is,a,sample,string'
>>> re.match("sample", line)
>>> re.search("sample", line)
<_sre.SRE_Match object at 0x021D32C0>
>>>

r stands for a raw string, so things like \ will be automatically escaped by Python.
Normally, if you wanted your pattern to include something like a backslash you'd need to escape it with another backslash. raw strings eliminate this problem.
short explanation
In your case, it does not matter much but it's a good habit to get into early otherwise something like \b will bite you in the behind if you are not careful (will be interpreted as backspace character instead of word boundary)
As per re.match vs re.search here's an example that will clarify it for you:
>>> import re
>>> testString = 'hello world'
>>> re.match('hello', testString)
<_sre.SRE_Match object at 0x015920C8>
>>> re.search('hello', testString)
<_sre.SRE_Match object at 0x02405560>
>>> re.match('world', testString)
>>> re.search('world', testString)
<_sre.SRE_Match object at 0x015920C8>
So search will find a match anywhere, match will only start at the beginning

You do not need regular expressions to check if a substring exists in a string.
line = 'This,is,a,sample,string'
result = bool('sample' in line) # returns True
If you want to know if a string contains a pattern then you should use re.search
line = 'This,is,a,sample,string'
result = re.search(r'sample', line) # finds 'sample'
This is best used with pattern matching, for example:
line = 'my name is bob'
result = re.search(r'my name is (\S+)', line) # finds 'bob'

As everyone else has mentioned it is better to use the "in" operator, it can also act on lists:
line = "This,is,a,sample,string"
lst = ['This', 'sample']
for i in lst:
i in line
>> True
>> True

One Liner implementation:
a=[1,3]
b=[1,2,3,4]
all(i in b for i in a)

Correct usage of \D in python?

I have some code where I am trying to find a certain set of numbers. The length varies and I do not want them to be found amongst other numbers. For example the following code:
reg="\D12345\D"
string="12345"
matchedResults = re.finditer(reg, string)
for match in matchedResults:
print match.group(0)
Does not work if the number is just by itself. However this will work if I put:
string="a12345"
but this will also match the a which is undesirable. Is there a better way to do this?

Use zero-width negative look-around assertions:
reg = r"(?<!\d)12345(?!\d)"
The look-around assertions (lookbehind and lookahead) match a position, not a character; the negative assertions only match if the preceding text or the following text respectively does not match the named pattern.
This means only locations that do not follow or precede a number will be matched; the start and end of a string will do for that purpose.
Demo:
>>> import re
>>> reg = re.compile(r"(?<!\d)12345(?!\d)")
>>> reg.search('12345')
<_sre.SRE_Match object at 0x102981ac0>
>>> reg.search('-12345-')
<_sre.SRE_Match object at 0x102a51238>
>>> reg.search('0123456')
>>> reg.search('012345-')
>>> reg.search('-123456')

extracting multiple instances regex python

I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?

Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().

You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)

This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')

How to escape special regex characters in a string?

I use re.findall(p, text) to match a pattern generally, but now I came across a question:
I just want p to be matched as a normal string, not regex.
For example: p may contain '+' or '*', I don't want these characters have special meanings as in regex. In another word, I want p to be matched character by character.
In this case p is unknown to me, so I can't add '\' into it to ignore special character.

You can use re.escape:
>>> p = 'foo+*bar'
>>> import re
>>> re.escape(p)
'foo\\+\\*bar'
Or just use string operations to check if p is inside another string:
>>> p in 'blablafoo+*bar123'
True
>>> 'foo+*bar foo+*bar'.count(p)
2
By the way, this is mainly useful if you want to embed p into a proper regex:
>>> re.match(r'\d.*{}.*\d'.format(re.escape(p)), '1 foo+*bar 2')
<_sre.SRE_Match object at 0x7f11e83a31d0>

If you don't need a regex, and just want to test if the pattern is a substring of the string, use:
if pattern in string:
If you want to test at the start or end of the string:
if string.startswith(pattern): # or .endswith(pattern)
See the string methods section of the docs for other string methods.
If you need to know all locations of a substring in a string, use str.find:
offsets = []
offset = string.find(pattern, 0)
while offset != -1:
offsets.append(offset)
# start from after the location of the previous match
offset = string.find(pattern, offset + 1)

You can use .find on strings. This returns the index of the first occurence of the "needle" string (or -1 if it's not found). e.g.
>>> a = 'test string 1+2*3'
>>> a.find('str')
5
>>> a.find('not there')
-1
>>> a.find('1+2*')
12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract string with specific format - python

I'm novice to Python and I am trying to extract a string from another string with specific format, for example: I have original string: - --#$_ABC1234-XX12X I need to extract exactly the string ABC1234 (must include three first characters and followed by four digits).

You can make use of re module in Python matcher = re.search((?P<matched_string>[a-zA-Z]{3}\d{4})) needed_string = matcher.groupdict()['matched_string'] needed_string will be your desired output. For the re module refer to: https://docs.python.org/3.4/library/re.html

If you now the exact coordinates of the string you can use something like this: >>> var = "--#$_ABC1234-XX12X" >>> newstring = var[5:12] >>> newstring 'ABC1234' a python string has a slice method.

Related

Python Regex matching already matched sub-string

Python match a string with regex [duplicate]

Correct usage of \D in python?

extracting multiple instances regex python

How to escape special regex characters in a string?

Categories

Resources