split string based on pattern python

split string based on pattern python - python

I am trying to delete a pattern off my string and only bring back the word I want to store.
example return
2022_09_21_PTE_Vendor PTE
2022_09_21_SSS_01_Vendor SSS_01
2022_09_21_OOS_market OOS
what I tried
fileName = "2022_09_21_PTE_Vendor"
newFileName = fileName.strip(re.split('[0-9]','_Vendor.xlsx'))

With Python's re module please try following Python code with its sub function written and tested in Python3 with shown samples. Documentation links for re and sub are added in hyperlinks used in their names in 1st sentence.
Here is the Online demo for used Regex.
import re
fileName = "2022_09_21_PTE_Vendor"
re.sub(r'^\d{4}(?:_\d{2}){2}_(.*?)_.+$', r'\1', fileName)
'PTE'
Explanation: Adding detailed explanation for used regex.
^\d{4} ##From starting of the value matching 4 digits here.
(?: ##opening a non-capturing group here.
_\d{2} ##Matching underscore followed by 2 digits
){2} ##Closing non-capturing group and matching its 2 occurrences.
_ ##Matching only underscore here.
(.*?) ##Creating capturing group here where using lazy match concept to get values before next mentioned character.
_.+$ ##Matching _ till end of the value here.

Use a regular expression replacement, not split.
newFileName = re.sub(r'^\d{4}_\d{2}_\d{2}_(.+)_[^_]+$', r'\1', fileName)
^\d{4}_\d{2}_\d{2}_ matches the date at the beginning. [^_]+$ matches the part after the last _. And (.+) captures everything between them, which is copied to the replacement with \1.

Assuming that the date characters at the beginning are always "YYYY_MM_DD" you could do something like this:
fileName = "2022_09_21_SSS_01_Vendor"
fileName = fileName.lstrip()[11:] // Removes the date portion
fileName = fileName.rstrip()[:fileName.rfind('_')] // Finds the last underscore and removes underscore to end
print(fileName)

This should work:
newFileName = fileName[11:].rsplit("_")[0]

Related

Regex to find N characters between underscore and period

I have a filename having numerals like test_20200331_2020041612345678.csv.
So I just want to read only first 8 characters from the number between last underscore and .csv using a regex.
For e.g: From the file name test_20200331_2020041612345678.csv --> i want to read only 20200416 using regex.
Regex tried: (?<=_)(\d+)(?=\.)
But it is returning the full number between underscore and period i.e 2020041612345678
Also, when tried quantifier like (?<=_)(\d{8})(?=\.) its not matching with any string

The (?<=_)(\d{8})(?=\.) does not work because the (?=\.) positive lookahead requires the presence of a . char immediately to the right of the current location, i.e. right after the eigth digit, but there are more digits in between.
You may add \d* before \. to match any amount of digits after the required 8 digits, use
(?<=_)\d{8}(?=\d*\.)
Or, with a capturing group, you do not even need lookarounds (just make sure you access Group 1 when a match is obtained):
_(\d{8})\d*\.
See the regex demo
Python demo:
import re
s = "test_20200331_2020041612345678.csv"
m = re.search(r"(?<=_)\d{8}(?=\d*\.)", s)
# m = re.search(r"_(\d{8})\d*\.", s) # capturing group approach
if m:
print(m.group()) # => 20200416
# print(m.group(1)) # capturing group approach

Using Regex to search for a string unless it finds another string first

Hello I'm trying to use regex to search through a markdown file for a date and only get a match if it finds an instance of a specific string before it finds another date.
This is what I have right now and it definitely doesn't work.
(\d{2}\/\d{2}\/\d{2})(string)?(^(\d{2}\/\d{2}\/\d{2}))
So in this instance It would throw a match since the string is before the next date:
01/20/20
string
01/21/20
Here it shouldn't match since the string is after the next date:
01/20/20
this isn't the phrase you're looking for
01/21/20
string
Any help on this would be greatly appreciated.

You could match a date like pattern. Then use a tempered greedy token approach (?:(?!\d{2}\/\d{2}\/\d{2}).)* to match string without matching another date first.
If you have matched the string, use a non greedy dot .*? to match the first occurrence of the next date.
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string.*?\d{2}\/\d{2}\/\d{2}
Regex demo | Python demo
For example (using re.DOTALL to make the dot match a newline)
import re
regex = r"\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}"
test_str = """01/20/20\n\n"
"string\n\n"
"01/21/20\n\n"
"01/20/20\n\n"
"this isn't the phrase you're looking for\n\n"
"01/21/20\n\n"
"string"""
print(re.findall(regex, test_str, re.DOTALL))
Output
['01/20/20\n\n"\n\t"string\n\n"\n\t"01/21/20']
If the string can not occur 2 times between the date, you might use
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}|string).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}
Regex demo
Note that if you don't want the string and the dates to be part of a larger word, you could add word boundaries \b

One approach here would be to use a tempered dot to ensure that the regex engine does not cross over the ending date while trying to find the string after the starting date. For example:
inp = """01/20/20
string # <-- this is matched
01/21/20
01/20/20
01/21/20
string""" # <-- this is not matched
matches = re.findall(r'01/20/20(?:(?!\b01/21/20\b).)*?(\bstring\b).*?\b01/21/20\b', inp, flags=re.DOTALL)
print(matches)
This prints string only once, that match being the first occurrence, which legitimately sits in between the starting and ending dates.

How to return whole non-latin strings matching a reduplication pattern, such as AAB or ABB

I am working with strings of non-latin characters.
I want to match strings with reduplication patterns, such as AAB, ABB, ABAB, etc.
I tried out the following code:
import re
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.findall(rawtext)
print(match)
However, it reurns only the first character of the matched string.
I know this happens because of the capturing parenthesis around the first \w.
I tried to add capturing parenthesis around the whole matched block, but Python gives
error: cannot refer to an open group at position 7
I also found this method,but didn't work for me:
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
print(match.group(1))
How could I match the pattern and return the whole matching string?
# Ex. 哈哈笑
# string matches AAB pattern so my code returns 哈
# but not the entire string

The message:
error: cannot refer to an open group at position 7
is telling you that \1 refers to the group with parentheses all around, because its opening parenthesis comes first. The group you want to backreference is number 2, so this code works:
import re
rawtext = 'abc 哈哈笑 def'
patternAAB = re.compile(r'\b((\w)\2\w)\b')
match = patternAAB.findall(rawtext)
print(match)
Each item in match has both groups:
[('哈哈笑', '哈')]

I also found this method, but didn't work for me:
You were close here as well. You can use match.group(0) to get the full match, not just a group in parentheses. So this code works:
import re
rawtext = 'abc 哈哈笑 def'
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
print(match.group(0)) # 哈哈笑

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.

Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com

You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)

In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.

I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

How to create non-greedy regular expression from right?

I have a file named 'ab9c_xy8z_12a3.pdf' . I want to capture part after the last underscore and before '.pdf'.
Writing regular expression like :
s = 'ab9c_xy8z_12a3.pdf'
m = re.search(r'_.*?\.pdf',s)
m.group(0)
returns:
'_xy8z_12a3.pdf'
In this example, I would like to capture only '12a3' part. Thank you for your help.

The _.*?\.pdf regex matches the first underscore with _, then matches any 0+ chars other than a newline, as few as possible, but up to the leftmost occurrence of .pdf, which turns out to be at the end of the string. So, . matched all underscores on its way to .pdf, just because of the way a regex engine parses the string (from left to right) and due to . pattern.
You may fix the pattern by using a negated character class [^_] instead of . that will "subtract" underscores from . pattern.
([^_]+)\.pdf
and grab Group 1 value. See the regex demo.
Python demo:
import re
rx = r"([^_]+)\.pdf"
s = "ab9c_xy8z_12a3.pdf"
m = re.search(rx, s)
if m:
print(m.group(1)) # => 12a3

Use re.split instead:
>>> re.split('[_.]', 'ab9c_xy8z_12a3.pdf')[-2]
'12a3'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split string based on pattern python - python

This should work: newFileName = fileName[11:].rsplit("_")[0]

Related

Regex to find N characters between underscore and period

Using Regex to search for a string unless it finds another string first

How to return whole non-latin strings matching a reduplication pattern, such as AAB or ABB

Python regex to match after the text and the dot [duplicate]

How to create non-greedy regular expression from right?

Categories

Resources