to find the pattern using regex? - python

curP = "https://programmers.co.kr/learn/courses/4673'>#!Muzi#Muzi!)jayg07con&&"
I want to find the Muzi from this string with regex
for example
MuziMuzi : count 0 because it considers as one word
Muzi&Muzi: count 2 because it has & between so it separate the word
7Muzi7Muzi : count 2
I try to use the regex to find all matched
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern = re.compile('[^a-zA-Z]muzi[^a-zA-Z]')
print(pattern.findall(curP))
I expected the ['!muzi#','#Muzi!']
but the result is
['!muzi#']

You need to use this as your regex:
pattern = re.compile('[^a-zA-Z]muzi(?=[^a-zA-Z])', flags=re.IGNORECASE)
(?=[^a-zA-Z]) says that muzi must have a looahead of [^a-zA-Z] but does not consume any characters. So the first match is only matching !Muzi leaving the following # available to start the next match.
Your original regex was consuming !Muzi# leaving Muzi!, which would not match the regex.
Your matches will now be:
['!Muzi', '#Muzi']

As I understand it you want to get any value that may appear on both sides of your keyword Muzi.
That means that the #, in this case, has to be shared by both output values.
The only way to do it using regex is to manipulate the string as you find patterns.
Here is my solution:
import re
# Define the function to find the pattern
def find_pattern(curP):
pattern = re.compile('([^a-zA-Z]muzi[^a-zA-Z])', flags=re.IGNORECASE)
return pattern.findall(curP)[0]
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern_array = []
# Find the the first appearence of pattern on the string
pattern_array.append(find_pattern(curP))
# Remove the pattern found from the string
curP = curP.replace('Muzi','',1)
#Find the the second appearence of pattern on the string
pattern_array.append(find_pattern(curP))
print(pattern_array)
Output:
['!Muzi#', '#Muzi!']

Related

Python get N characters from string with specific substring

I have a very long string that I extracted from a image file.
The string can look like this
...\n\nDate: 01.01.2022\n\nArticle-no: 123456789\n\nArticle description: asdfqwer 1234...\n...
How do I extract just the 10 characters after the substring "Article-no:"?
I tried solving it with a different approach using rfind like this but it tends to fail every now and then if the start and end string is not accurate.
s = "... string shown above ..."
start = "Article-no: "
end = "Article description: "
print(s[s.find(start)+len(start):s.rfind(end)])
you can use split:
string.split("Article-no: ", 1)[1][0:10]
For this, a regular expression might come in very handy.
import re
# Create a pattern which matches "Article-no: " literally,
# and then grabs the digits that follow.
pattern = re.compile(r"Article-no: (\d+)")
s = "...\n\nDate: 01.01.2022\n\nArticle-no: 123456789\n\nArticle description: asdfqwer 1234...\n..."
match = pattern.search(s)
if match:
print(match.group(1))
This outputs:
123456789
The regular expression used is Article-no: (\d+), which has the following parts:
Article-no: # Match this text literally
( # Open a new group (i.e. group 1)
\d+ # Match 1 or more occurrences of a digit
) # Close group 1
The re module will search the string for places where this matches, and then you can extract the digit from the matches.

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Regex: skip the first match of a character in group?

From this string
s = 'stringalading-0.26.0-1'
I'd like to extract the part 0.26.0-1. I can think of various ways to achieve this, using split or a regular expression using a pattern like this
pattern = r'\d+\.\d+\.\d+\-\d+'
I also tried to use a group of characters, like so:
pattern = r'[.\-\d]+'
This gives me:
In [30]: re.findall(pattern, s)
Out[30]: ['-0.26.0-1']
So I wondered: is it possible to skip the first occurrence of a character in a group, in this case the first occurrence of -?
is it possible to to skip the first occurrence of a character in a group, in this case the first occurrence of -?
NO, because when matching, the regex engine processes the string from left to right, and once the matching pattern is found, the matched chunk of text is written to the match buffer. Thus, either write a regex that only matches what you need, or post-process the found result by stripping unwanted characters from the left.
I think you do not need a regex here. You can split the string with - and pass the maxsplit argument set to 1, then just access the second item:
s = 'stringalading-0.26.0-1'
print(s.split("-", 1)[1]) # => '0.26.0-1'
See the Python demo
Also, your first regex works well:
import re
s = 'stringalading-0.26.0-1'
pat = r'\d+\.\d+\.\d+-\d+'
print(re.findall(pat, s)) # => ['0.26.0-1']
Do:
-(.*)
and get captured group 1.
Example:
In [9]: s = 'stringalading-0.26.0-1'
In [10]: re.search(r'-(.*)', s).group(1)
Out[10]: '0.26.0-1'

Find which part of a multiple regex gave a match

I have a multiple regex which combines thousands of different regexes e.g r"reg1|reg2|...".
I'd like to know which one of the regexes gave a match in re.search(r"reg1|reg2|...", text), and I cannot figure how to do it since `re.search(r"reg1|reg2|...", text).re.pattern gives the whole regex.
For example, if my regex is r"foo[0-9]|bar", my pattern "foo1", I'd like to get as an answer "foo[0-9].
Is there any way to do this ?
Wrap each sub-regexp in (). After the match, you can go through all the groups in the matcher (match.group(index)). The non-empty group will be the one that matched.
You could put each possible regex into a list, then checking them in series, as this would be faster than one very large regex, and allow you to figure out which matched as you need to:
mystring = "Some string you're searching in."
regs = ['reg1', 'reg2', 'reg3', ...]
matching_reg = None
for reg in regs:
match = re.search(reg, mystring)
if match:
matching_reg = reg
break
After that, match and matching_reg will both be None if no match was found. If a match was found, match will contain the regex result and matching_reg will contain the regex search string from regs that matched.
Note that break is used to stop attempting to match as soon as a match is found.

Categories