Regex expression to strip ending of word - python

I have the following identifiers:
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
I need a regex to get me the following output:
id1 = '883316040119'
id2 = 'ZWEX01DE9463DB'
id3 = '35358'
id4 = 'as3d99j'
Here is what I have so far --
re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)
It doesn't work perfectly though, here is what it gives me:
BAD - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j
What would be the correct regular expression to get all of them? For the first one, I basically want to strip the ending if it is only underscores and letters, so 1928h9829_bundle_hd --> 1928h9829.
Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. I'm not looking for a python split() way to do it, as it wouldn't work.

The way you present your input, I would suggest this simple regex:
^(?:[^_]+(?=_)|\d+)
This can be tweaked if you want to add details to the spec.
To show you a regex demo, just because of the way the site regex101 works, we have to add \n (it assumes we are working on the whole file, rather than one input at a time): DEMO
Explanation
The ^ anchor asserts that we are at the beginning of the string
The non-capture group (?: ... ) matches either
[^_]+(?=_) non-underscore characters (followed by an underscore, not matched)
| OR
\d+ digits

This works for the examples:
for id in ids :
print (id)
883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001
for id in ids :
hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
print (hit)
883316040119
ZWEX01DE9463DB
35358
as3d99j
When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit.

You are checking for underscore only one possible time, as ? means {0,1}.
r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'

The following reproduces your desired results from your input.
I would use the replace method with this regex:
_[^']+|(?!.*_)('[0-9]+)[^']+
and return capturing group 1
Perhaps:
result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)
The regex first looks for an underscore. If it finds one, it will match everything up to but not including the next single quote; and that will get removed.
If that doesn't match, the alternative will look for a string that does NOT have an underscore; match and return in capturing group 1 the sequence of digits; and then replace everything after the digits up to but not including the single quote.

This is not subtraction approach. Just capture matched string.
The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_).(ie (^\d+)|(^[\d\w]+(?=_)))
import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]
for i in ids:
try:
print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
except:
print "not matched"
output:
883316040119
ZWEX01DE9463DB
35358
as3d99j

Related

split string based on pattern python

I am trying to delete a pattern off my string and only bring back the word I want to store.
example return
2022_09_21_PTE_Vendor PTE
2022_09_21_SSS_01_Vendor SSS_01
2022_09_21_OOS_market OOS
what I tried
fileName = "2022_09_21_PTE_Vendor"
newFileName = fileName.strip(re.split('[0-9]','_Vendor.xlsx'))
With Python's re module please try following Python code with its sub function written and tested in Python3 with shown samples. Documentation links for re and sub are added in hyperlinks used in their names in 1st sentence.
Here is the Online demo for used Regex.
import re
fileName = "2022_09_21_PTE_Vendor"
re.sub(r'^\d{4}(?:_\d{2}){2}_(.*?)_.+$', r'\1', fileName)
'PTE'
Explanation: Adding detailed explanation for used regex.
^\d{4} ##From starting of the value matching 4 digits here.
(?: ##opening a non-capturing group here.
_\d{2} ##Matching underscore followed by 2 digits
){2} ##Closing non-capturing group and matching its 2 occurrences.
_ ##Matching only underscore here.
(.*?) ##Creating capturing group here where using lazy match concept to get values before next mentioned character.
_.+$ ##Matching _ till end of the value here.
Use a regular expression replacement, not split.
newFileName = re.sub(r'^\d{4}_\d{2}_\d{2}_(.+)_[^_]+$', r'\1', fileName)
^\d{4}_\d{2}_\d{2}_ matches the date at the beginning. [^_]+$ matches the part after the last _. And (.+) captures everything between them, which is copied to the replacement with \1.
Assuming that the date characters at the beginning are always "YYYY_MM_DD" you could do something like this:
fileName = "2022_09_21_SSS_01_Vendor"
fileName = fileName.lstrip()[11:] // Removes the date portion
fileName = fileName.rstrip()[:fileName.rfind('_')] // Finds the last underscore and removes underscore to end
print(fileName)
This should work:
newFileName = fileName[11:].rsplit("_")[0]

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Using Python Regex to find a phrase between 2 tags

I've got a string that I want to use regex to find the characters encapsulated between two known patterns, "Cp_6%3A" then some characters then "&" and potentially more characters, or no & and just the end of string.
My code looks like this:
def extract_id_from_ref(ref):
id = re.search("Cp\_6\%3A(.*?)(\& | $)", ref)
print(id)
But this isn't producing anything, Any ideas?
Thanks in advance
Note that (\& | $) matches either the & char and a space after it, or a space and end of string (the spaces are meaningful here!).
Use a negated character class [^&]* (zero or more chars other than &) to simplify the regex (no need for an alternation group or lazy dot matching pattern) and then access .group(1):
def extract_id_from_ref(ref):
m = re.search(r"Cp_6%3A([^&]*)", ref)
if m:
print(m.group(1))
Note that neither _ nor % are special regex metacharacters, and do not have to be escaped.
See the regex demo.
The problem is that spaces in a regex pattern, are also taken into account. Furthermore in order to add a backspace to the string, you either have to add \\ (two backslashes) or use a raw string:
So you should write:
r"Cp_6\%3A(.*?)(?:\&|$)"
If you then match with:
def extract_id_from_ref(ref):
id = re.search(r"Cp_6\%3A(.*?)(?:\&|$)", ref)
print(id)
It should work.

Searching for multiple substrings of unknown size in string in python

I've seen lots of RE stuff in python but nothing for the exact case and I can't seem to get it. I have a list of files with names that look like this:
summary_Cells_a_01_2_1_45000_it_1.txt
summary_Cells_a_01_2_1_40000_it_2.txt
summary_Cells_bb_01_2_1_36000_it_3.txt
The "summary_Cells_" is always present. Then there is a string of letters, either 1, 2 or 3 long. Then there is "_01_2_1_" always. Then there is a number between 400 and 45000. Then there is "it" and then a number from 0-9, then ".txt"
I need to extract the letter(s) piece.
I was trying:
match = re.search('summary_Cells_(\w)_01_2_1_(\w)_it_(\w).txt', filename)
but was not getting anything for the match. I'm trying to get just the letters, but later might want the it number (last number) or the step (the middle number).
Any ideas?
Thanks
You're missing repetitions, i.e.:
re.search('summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
\w will only match a single character
\w+ will match at least one
\w* will match any amount (0 or more)
Reference: Regular expression syntax
You were almost there all you need to do is to repeat the regex in caputure group
summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt
Example usage
>>> filename="summary_Cells_a_01_2_1_45000_it_1.txt"
>>> match = re.search(r'summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
>>> match.group()
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(0)
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(1)
'a'
>>> match.group(2)
'45000'
>>> match.group(3)
'1'
Note
The match.group(n) will return the value captured by the nth caputre group
You don't need a regex, there is nothing complex about the pattern and it does not change:
s = "summary_Cells_a_01_2_1_45000_it_1.txt"
print(s.split("_")[2])
a
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
print(s.split("_")[2])
bb
If you want both sets of lettrrs:
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
spl = s.split("_")
a,b = spl[2],spl[7]
print(a,b)
('bb', 'it')
Since you only want to capture the letters at the beginning, you could do:
re.search('summary_Cells_(\w+)_01_2_1_[0-9]{3,6}_it_[0-9].txt', filename)
Which doesn't bother giving you the groups you don't need.
[0-9] looks for a number and [0-9]{3,6} allows for 3 to 6 numbers.
You're on the right track with your regex, but as everyone else forgets, \w includes alphanumerics and the underscore, so you should use [a-z] instead.
re.search(r"summary_Cells_([a-z]+)_\w+\.txt", filename)
Or, as Padraic mentioned, you can just use str.split("_").

Python re.finditer match.groups() does not contain all groups from match

I am trying to use regex in Python to find and print all matching lines from a multiline search.
The text that I am searching through may have the below example structure:
AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA
From which I want to retrieve the ABC*s that occur at least once and are preceeded by an AAA.
The problem is, that despite the group catching what I want:
match = <_sre.SRE_Match object; span=(19, 38), match='AAA\nABC2\nABC3\nABC4\n'>
... I can access only the last match of the group:
match groups = ('AAA\n', 'ABC4\n')
Below is the example code that I use for this problem.
#! python
import sys
import re
import os
string = "AAA\nABC1\nABC2\nABC3\nAAA\nABC1\nABC2\nABC3\nABC4\nABC\nAAA\nABC1\nAAA\n"
print(string)
p_MATCHES = []
p_MATCHES.append( (re.compile('(AAA\n)(ABC[0-9]\n){1,}')) ) #
matches = re.finditer(p_MATCHES[0],string)
for match in matches:
strout = ''
gr_iter=0
print("match = "+str(match))
print("match groups = "+str(match.groups()))
for group in match.groups():
gr_iter+=1
sys.stdout.write("TEST GROUP:"+str(gr_iter)+"\t"+group) # test output
if group is not None:
if group != '':
strout+= '"'+group.replace("\n","",1)+'"'+'\n'
sys.stdout.write("\nCOMPLETE RESULT:\n"+strout+"====\n")
Here is your regular expression:
(AAA\r\n)(ABC[0-9]\r\n){1,}
Debuggex Demo
Your goal is to capture all ABC#s that immediately follow AAA. As you can see in this Debuggex demo, all ABC#s are indeed being matched (they're highlighted in yellow). However, since only the "what is being repeated" part
ABC[0-9]\r\n
is being captured (is inside the parentheses), and its quantifier,
{1,}
is not being captured, this therefore causes all matches except the final one to be discarded. To get them, you must also capture the quantifier:
AAA\r\n((?:ABC[0-9]\r\n){1,})
Debuggex Demo
I've placed the "what is being repeated" part (ABC[0-9]\r\n) into a non-capturing group. (I've also stopped capturing AAA, as you don't seem to need it.)
The captured text can be split on the newline, and will give you all the pieces as you wish.
(Note that \n by itself doesn't work in Debuggex. It requires \r\n.)
This is a workaround. Not many regular expression flavors offer the capability of iterating through repeating captures (which ones...?). A more normal approach is to loop through and process each match as they are found. Here's an example from Java:
import java.util.regex.*;
public class RepeatingCaptureGroupsDemo {
public static void main(String[] args) {
String input = "I have a cat, but I like my dog better.";
Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
}
}
Output:
cat
dog
(From http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/, about a 1/4 down)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. The links in this answer come from it.
You want the pattern of consecutive ABC\n occurring after a AAA\n in the most greedy way. You also want only the group of consecutive ABC\n and not a tuple of that and the most recent ABC\n. So in your regex, exclude the subgroup within the group.
Notice the pattern, write the pattern that represents the whole string.
AAA\n(ABC[0-9]\n)+
Then capture the one you are interested in with (), while remembering to exclude subgroup(s)
AAA\n((?:ABC[0-9]\n)+)
You can then use either findall() or finditer(). I find findIter easier especially when you are dealing with more than one capture.
finditer:-
import re
matches_iter = re.finditer(r'AAA\n((?:ABC[0-9]\n)+)', string)
[print(i.group(1)) for i in matches_iter]
findall, used the original {1,} as its a more verbose form of + :-
matches_all = re.findall(r'AAA\n((?:ABC[0-9]\n){1,})', string)
[[print(x) for x in y.split("\n")] for y in matches_all]

Categories