Match string using regular expression except specific string combinations python - python

In a list I need to match specific instances, except for a specific combination of strings:
let's say I have a list of strings like the following:
l = [
'PSSTFRPPLYO',
'BNTETNTT',
'DE52 5055 0020 0005 9287 29',
'210-0601001-41',
'BSABESBBXXX',
'COMMERZBANK'
]
I need to match all the words that points to a swift / bic code, this code has the following form:
6 letters followed by
2 letters/digits followed by
3 optional letters / digits
hence I have written the following regex to match such specific pattern
import re
regex = re.compile(r'(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)')
for item in l:
match = regex.search(item)
if match:
print('found a match, the matched string {} the match {}'.format( item, item[match.start() : match.end()]
else:
print('found no match in {}'.format(item)
I need the following cases to be macthed:
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX' ]
rather I get
result = ['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX', 'COMMERZBANK' ]
so what I need is to match only the strings that don't contain the word 'bank'
to do so I have refined my regex to :
regex = re.compile((?<!bank/i)(?<!\w)[a-zA-Z]{6}[a-zA-Z0-9]{2}([a-zA-Z0-9]{3})?(?!\w)(?!bank/i))
simply I have used negative look behind and ahead for more information about theses two concepts refer to link
My regex doesn't do the filtration intended to do, what did I miss?

You can try this:
import re
final_vals = [i for i in l if re.findall('^[a-zA-Z]{6}\w{2}|(^[a-zA-Z]{6}\w{2}\w{3})', i) and not re.findall('BANK', i, re.IGNORECASE)]
Output:
['PSSTFRPPLYO', 'BNTETNTT', 'BSABESBBXXX']

Related

Keep records with a specific prefix and filter all-numeric

I have a pyspark dataframe looks like below:
serial_number
000001234
000002887
00008765
0745-218
01-7865
040/7868L
0000124
00002364
01231325246
068775H
I want to extract only the records that start with the prefix 0 (single 0 at start) and that are not only numeric. i.e. it should have alphabetic and/or special characters only numeric. So I want to only keep:
serial_number
0745-218
01-7865
040/7868L
068775H
I tried to use some regex expressions like ^0[^0] but it also accepts all-numeric entries.
Use rlike. Code below
df.where(col('serial_number').rlike('\D')&col('serial_number').rlike('^0')).show()
following Hà Nguyễn's answer:
import re
COMPILED = re.compile("0[^0]\d*[^\d]+\d*")
serial_numbers = [
"000001234",
"000002887",
"00008765",
"0745-218",
"01-7865",
"040/7868L",
"0000124",
"00002364",
"01231325246",
"068775H"
]
matching_numbers = [number for number in serial_numbers if COMPILED.match(number)]
print(matching_numbers)
no need of ^ as match matches from the start of the string.
\d is practically syntactic sugar for 0-9
import re
serial_numbers = [
"000001234",
"000002887",
"00008765",
"0745-218",
"01-7865",
"040/7868L",
"0000124",
"00002364",
"01231325246",
"068775H"
]
pattern = "^0[^0-9]+"
matching_numbers = [number for number in serial_numbers if re.match(pattern, number)]
print(matching_numbers)

Select String after string with regex in python

Imagine that we have a string like:
Routing for Networks:
0.0.0.0/32
5.6.4.3/24
2.3.1.4/32
Routing Information Sources:
Gateway Distance Last Update
192.168.61.100 90 00:33:51
192.168.61.103 90 00:33:43
Irregular IPs:
1.2.3.4/24
5.4.3.3/24
I need to get a list of IPs between "Routing for Networks:" and "Routing Information Sources:" like below:
['0.0.0.0/32","5.6.4.3/24","2.3.1.4/32"]
What I have done till now is:
Routing for Networks:\n(.+(?:\n.+)*)\nRouting
But it is not working as expected.
UPDATE:
my code is as bellow:
re.findall("Routing for Networks:\n(.+(?:\n.+)*)\nRouting", string)
The value of capture group 1 included the newlines. You can split the value of capture group 1 on a newline to get the separated values.
If you want to use re.findall, you will a list of group 1 values, and you can split every value in the list on a newline.
An example with a single group 1 match:
import re
pattern = r"Routing for Networks:\n(.+(?:\n.+)*)\nRouting"
s = ("Routing for Networks:\n"
"0.0.0.0/32\n"
"5.6.4.3/24\n"
"2.3.1.4/32\n"
"Routing Information Sources:\n"
"Gateway Distance Last Update\n"
"192.168.61.100 90 00:33:51\n"
"192.168.61.103 90 00:33:43")
m = re.search(pattern, s)
if m:
print(m.group(1).split("\n"))
Output
['0.0.0.0/32', '5.6.4.3/24', '2.3.1.4/32']
For a bit more precise match, and if there can be multiple of the same consecutive parts, you can match the format and use an assertion for Routing instead of a match:
Routing for Networks:\n((?:(?:\d{1,3}\.){3}\d{1,3}/\d+\n)+)(?=Routing)
Example
pattern = r"Routing for Networks:\n((?:(?:\d{1,3}\.){3}\d{1,3}/\d+\n)+)(?=Routing)"
s = "..."
m = re.search(pattern, s)
if m:
print([s for s in m.group(1).split("\n") if s])
See a regex demo and a Python demo.

regular expressions in python for searching file

I want to know how to get files which match this type:
recording_i.file_extension
Ex:
recording_1.mp4
recording_112.mp4
recording_11.mov
I have a regular expression:
(recording_\d*)(\..*)
My regular expression doesn't works as i want.
Wrong file names which not match my type: lalala_recording_1.mp4, recording_.mp4
But my re works for this examples, however my code should return [] for this examples.
Can u fix my regular expression, please?
Thanks.
Use
(^recording_\d+)(\.\w{3}$)
Test
import re
s = """recording_1.mp4
recording_112.mp4
recording_11.mov
lalala_recording_1.mp4,
recording_.mp4"""
pattern = re.compile(r"(^recording_\d+)(\.\w{3}$)")
for l in s.split():
if pattern.match(l):
print(l)
Output (only the desired files)
recording_1.mp4
recording_112.mp4
recording_11.mov
Explanation
With r"(^recording_\d+)(\.\w{3}$)"--1)
- use \d+ since need at least one number
- \w{3} for three letter suffix
- ^ to ensure starts with recording
- $ to ensure ends after suffix
Particular Suffixes
import re
# List of suffixes to match
suffixes_list = ['mp4', 'mov']
suffixes = '|'.join(suffixes_list)
# Use suffixes in pattern (rather than excepting
# any 3 letter word
pattern = re.compile(fr"(^recording_\d+)(\.{suffixes}$)")
Test
s = """recording_1.mp4
recording_112.mp4
recording_11.mov
lalala_recording_1.mp4,
recording_.mp4
dummy1.exe
dummy2.pdf
dummy3.exe"""
for l in s.split():
if pattern.match(l):
print(l)
Output
recording_1.mp4
recording_112.mp4
recording_11.mov

Filtering a list of strings using regex

I have a list of strings that looks like this,
strlist = [
'list/category/22',
'list/category/22561',
'list/category/3361b',
'list/category/22?=1512',
'list/category/216?=591jf1!',
'list/other/1671',
'list/1y9jj9/1yj32y',
'list/category/91121/91251',
'list/category/0027',
]
I want to use regex to find the strings in this list, that contain the following string /list/category/ followed by an integer of any length, but that's it, it cannot contain any letters or symbols after that.
So in my example, the output should look like this
list/category/22
list/category/22561
list/category/0027
I used the following code:
newlist = []
for i in strlist:
if re.match('list/category/[0-9]+[0-9]',i):
newlist.append(i)
print(i)
but this is my output:
list/category/22
list/category/22561
list/category/3361b
list/category/22?=1512
list/category/216?=591jf1!
list/category/91121/91251
list/category/0027
How do I fix my regex? And also is there a way to do this in one line using a filter or match command instead of a for loop?
You can try the below regex:
^list\/category\/\d+$
Explanation of the above regex:
^ - Represents the start of the given test String.
\d+ - Matches digits that occur one or more times.
$ - Matches the end of the test string. This is the part your regex missed.
Demo of the above regex in here.
IMPLEMENTATION IN PYTHON
import re
pattern = re.compile(r"^list\/category\/\d+$", re.MULTILINE)
match = pattern.findall("list/category/22\n"
"list/category/22561\n"
"list/category/3361b\n"
"list/category/22?=1512\n"
"list/category/216?=591jf1!\n"
"list/other/1671\n"
"list/1y9jj9/1yj32y\n"
"list/category/91121/91251\n"
"list/category/0027")
print (match)
You can find the sample run of the above implementation here.

how to get a pattern repeating multiple times in a string using regular expression

I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']

Categories